[jira] [Commented] (ARROW-18123) [Python] Cannot use multi-byte characters in file names
[ https://issues.apache.org/jira/browse/ARROW-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623177#comment-17623177 ] SHIMA Tatsuya commented on ARROW-18123: --- Thanks for your comment. But does it explain that relative paths can be used if they do not contain multibyte characters? The sample code appears to use relative paths. And, the documentation I am looking at does not seem to have a link to that detailed explanation. [https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html] > [Python] Cannot use multi-byte characters in file names > --- > > Key: ARROW-18123 > URL: https://issues.apache.org/jira/browse/ARROW-18123 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > Error when specifying a file path containing multi-byte characters in > {{pyarrow.parquet.write_table}}. > For example, use {{例.parquet}} as the file path. > {code:python} > Python 3.10.7 (main, Oct 5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import pandas as pd > >>> import numpy as np > >>> import pyarrow as pa > >>> df = pd.DataFrame({'one': [-1, np.nan, 2.5], > ...'two': ['foo', 'bar', 'baz'], > ...'three': [True, False, True]}, > ...index=list('abc')) > >>> table = pa.Table.from_pandas(df) > >>> import pyarrow.parquet as pq > >>> pq.write_table(table, '例.parquet') > Traceback (most recent call last): > File "", line 1, in > File > "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", > line 2920, in write_table > with ParquetWriter( > File > "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", > line 911, in __init__ > filesystem, path = _resolve_filesystem_and_path( > File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line > 184, in _resolve_filesystem_and_path > filesystem, path = FileSystem.from_uri(path) > File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18123) [Python] Cannot use multi-byte characters in file names
SHIMA Tatsuya created ARROW-18123: - Summary: [Python] Cannot use multi-byte characters in file names Key: ARROW-18123 URL: https://issues.apache.org/jira/browse/ARROW-18123 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 9.0.0 Reporter: SHIMA Tatsuya Error when specifying a file path containing multi-byte characters in {{pyarrow.parquet.write_table}}. For example, use {{例.parquet}} as the file path. {code:python} Python 3.10.7 (main, Oct 5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> import numpy as np >>> import pyarrow as pa >>> df = pd.DataFrame({'one': [-1, np.nan, 2.5], ...'two': ['foo', 'bar', 'baz'], ...'three': [True, False, True]}, ...index=list('abc')) >>> table = pa.Table.from_pandas(df) >>> import pyarrow.parquet as pq >>> pq.write_table(table, '例.parquet') Traceback (most recent call last): File "", line 1, in File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", line 2920, in write_table with ParquetWriter( File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", line 911, in __init__ filesystem, path = _resolve_filesystem_and_path( File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line 184, in _resolve_filesystem_and_path filesystem, path = FileSystem.from_uri(path) File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet' {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17737) [R] Groups before conversion to a Table must not be restored after `collect()`
[ https://issues.apache.org/jira/browse/ARROW-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17737: -- Summary: [R] Groups before conversion to a Table must not be restored after `collect()` (was: [R] Continue to retain grouping metadata even if ungroup arrow dplyr query) > [R] Groups before conversion to a Table must not be restored after `collect()` > -- > > Key: ARROW-17737 > URL: https://issues.apache.org/jira/browse/ARROW-17737 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it > becomes an arrow dplyr query. > And it must also be written back again when converted to a Table. > {code:r} > mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> > as.data.frame() |> dplyr::group_vars() > #> character(0) > mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> > as.data.frame() |> dplyr::group_vars() > #> [1] "cyl" > {code} > {code:r} > mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::group_by(vs, > .add = TRUE) |> dplyr::ungroup() |> dplyr::collect() > #> # A tibble: 32 × 11 > #> # Groups: cyl [3] > #> mpg cyl disphp dratwt qsecvsam gear carb > #> > #> 1 21 6 160110 3.9 2.62 16.5 0 1 4 4 > #> 2 21 6 160110 3.9 2.88 17.0 0 1 4 4 > #> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 > #> 4 21.4 6 258110 3.08 3.22 19.4 1 0 3 1 > #> 5 18.7 8 360175 3.15 3.44 17.0 0 0 3 2 > #> 6 18.1 6 225105 2.76 3.46 20.2 1 0 3 1 > #> 7 14.3 8 360245 3.21 3.57 15.8 0 0 3 4 > #> 8 24.4 4 147.62 3.69 3.19 20 1 0 4 2 > #> 9 22.8 4 141.95 3.92 3.15 22.9 1 0 4 2 > #> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 > #> # … with 22 more rows > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17429) [R] Error messages are not helpful of read_csv_arrow with col_types option
[ https://issues.apache.org/jira/browse/ARROW-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya resolved ARROW-17429. --- Resolution: Fixed Seems fixed by ARROW-17355 > [R] Error messages are not helpful of read_csv_arrow with col_types option > -- > > Key: ARROW-17429 > URL: https://issues.apache.org/jira/browse/ARROW-17429 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > The error message displayed when a non-convertible type is specified does not > seem to help in the development version. > {code:r} > tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) > csv_file <- tempfile() > on.exit(unlink(csv_file)) > write.csv(tbl, csv_file, row.names = FALSE) > arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01 00:00:00 > arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01T12:00:00+12:00 > arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) > #> Error in as.data.frame(tab): object 'tab' not found > arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) > #> Error in as.data.frame(tab): object 'tab' not found > {code} > In arrow 9.0.0 > {code:r} > tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) > csv_file <- tempfile() > on.exit(unlink(csv_file)) > write.csv(tbl, csv_file, row.names = FALSE) > arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01 00:00:00 > arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01T12:00:00+12:00 > arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) > #> Error: > #> ! Invalid: In CSV column #0: CSV conversion error to int32: invalid value > '1970-01-01T12:00:00+12:00' > arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) > #> Error: > #> ! Invalid: In CSV column #0: CSV conversion error to timestamp[ns]: > expected no zone offset in '1970-01-01T12:00:00+12:00' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17737) [R] Continue to retain grouping metadata even if ungroup arrow dplyr query
[ https://issues.apache.org/jira/browse/ARROW-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya reassigned ARROW-17737: - Assignee: SHIMA Tatsuya > [R] Continue to retain grouping metadata even if ungroup arrow dplyr query > -- > > Key: ARROW-17737 > URL: https://issues.apache.org/jira/browse/ARROW-17737 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > > Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it > becomes an arrow dplyr query. > And it must also be written back again when converted to a Table. > {code:r} > mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> > as.data.frame() |> dplyr::group_vars() > #> character(0) > mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> > as.data.frame() |> dplyr::group_vars() > #> [1] "cyl" > {code} > {code:r} > mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::group_by(vs, > .add = TRUE) |> dplyr::ungroup() |> dplyr::collect() > #> # A tibble: 32 × 11 > #> # Groups: cyl [3] > #> mpg cyl disphp dratwt qsecvsam gear carb > #> > #> 1 21 6 160110 3.9 2.62 16.5 0 1 4 4 > #> 2 21 6 160110 3.9 2.88 17.0 0 1 4 4 > #> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 > #> 4 21.4 6 258110 3.08 3.22 19.4 1 0 3 1 > #> 5 18.7 8 360175 3.15 3.44 17.0 0 0 3 2 > #> 6 18.1 6 225105 2.76 3.46 20.2 1 0 3 1 > #> 7 14.3 8 360245 3.21 3.57 15.8 0 0 3 4 > #> 8 24.4 4 147.62 3.69 3.19 20 1 0 4 2 > #> 9 22.8 4 141.95 3.92 3.15 22.9 1 0 4 2 > #> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 > #> # … with 22 more rows > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17737) [R] Continue to retain grouping metadata even if ungroup arrow dplyr query
[ https://issues.apache.org/jira/browse/ARROW-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17737: -- Description: Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it becomes an arrow dplyr query. And it must also be written back again when converted to a Table. {code:r} mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> as.data.frame() |> dplyr::group_vars() #> character(0) mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> as.data.frame() |> dplyr::group_vars() #> [1] "cyl" {code} {code:r} mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::group_by(vs, .add = TRUE) |> dplyr::ungroup() |> dplyr::collect() #> # A tibble: 32 × 11 #> # Groups: cyl [3] #> mpg cyl disphp dratwt qsecvsam gear carb #> #> 1 21 6 160110 3.9 2.62 16.5 0 1 4 4 #> 2 21 6 160110 3.9 2.88 17.0 0 1 4 4 #> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 #> 4 21.4 6 258110 3.08 3.22 19.4 1 0 3 1 #> 5 18.7 8 360175 3.15 3.44 17.0 0 0 3 2 #> 6 18.1 6 225105 2.76 3.46 20.2 1 0 3 1 #> 7 14.3 8 360245 3.21 3.57 15.8 0 0 3 4 #> 8 24.4 4 147.62 3.69 3.19 20 1 0 4 2 #> 9 22.8 4 141.95 3.92 3.15 22.9 1 0 4 2 #> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 #> # … with 22 more rows {code} was: Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it becomes an arrow dplyr query. And it must also be written back again when converted to a Table. {code:r} mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> as.data.frame() |> dplyr::group_vars() #> character(0) mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> as.data.frame() |> dplyr::group_vars() #> [1] "cyl" {code} > [R] Continue to retain grouping metadata even if ungroup arrow dplyr query > -- > > Key: ARROW-17737 > URL: https://issues.apache.org/jira/browse/ARROW-17737 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it > becomes an arrow dplyr query. > And it must also be written back again when converted to a Table. > {code:r} > mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> > as.data.frame() |> dplyr::group_vars() > #> character(0) > mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> > as.data.frame() |> dplyr::group_vars() > #> [1] "cyl" > {code} > {code:r} > mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::group_by(vs, > .add = TRUE) |> dplyr::ungroup() |> dplyr::collect() > #> # A tibble: 32 × 11 > #> # Groups: cyl [3] > #> mpg cyl disphp dratwt qsecvsam gear carb > #> > #> 1 21 6 160110 3.9 2.62 16.5 0 1 4 4 > #> 2 21 6 160110 3.9 2.88 17.0 0 1 4 4 > #> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 > #> 4 21.4 6 258110 3.08 3.22 19.4 1 0 3 1 > #> 5 18.7 8 360175 3.15 3.44 17.0 0 0 3 2 > #> 6 18.1 6 225105 2.76 3.46 20.2 1 0 3 1 > #> 7 14.3 8 360245 3.21 3.57 15.8 0 0 3 4 > #> 8 24.4 4 147.62 3.69 3.19 20 1 0 4 2 > #> 9 22.8 4 141.95 3.92 3.15 22.9 1 0 4 2 > #> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 > #> # … with 22 more rows > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17738) [R] dplyr::compute should convert from grouped arrow_dplyr_query to arrow Table
[ https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606158#comment-17606158 ] SHIMA Tatsuya commented on ARROW-17738: --- I think it is confusing to users when compute does not result in a Table as intended when the group is left after summarise, etc. is executed. {code:r} mtcars |> arrow::arrow_table() |> dplyr::group_by(vs, am) |> dplyr::summarise(wt = mean(wt)) |> dplyr::compute() #> Table (query) #> vs: double #> am: double #> wt: double #> #> * Grouped by vs #> See $.data for the source Arrow object {code} > [R] dplyr::compute should convert from grouped arrow_dplyr_query to arrow > Table > --- > > Key: ARROW-17738 > URL: https://issues.apache.org/jira/browse/ARROW-17738 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > It is expected that {{dplyr::compute()}} will perform the calculation on the > arrow dplyr query and convert it to a Table, but it does not seem to work > correctly for grouped arrow dplyr queries and does not result in a Table. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> > dplyr::compute() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > {{as_arrow_table()}} works fine. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > dplyr::collect(FALSE) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > arrow::as_arrow_table() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > It seems to revert to arrow dplyr query in the following line. > [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17738) [R] dplyr::compute should convert from grouped arrow_dplyr_query to arrow Table
[ https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17738: -- Summary: [R] dplyr::compute should convert from grouped arrow_dplyr_query to arrow Table (was: [R] dplyr::compute converts from grouped arrow_dplyr_query to arrow Table) > [R] dplyr::compute should convert from grouped arrow_dplyr_query to arrow > Table > --- > > Key: ARROW-17738 > URL: https://issues.apache.org/jira/browse/ARROW-17738 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > It is expected that {{dplyr::compute()}} will perform the calculation on the > arrow dplyr query and convert it to a Table, but it does not seem to work > correctly for grouped arrow dplyr queries and does not result in a Table. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> > dplyr::compute() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > {{as_arrow_table()}} works fine. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > dplyr::collect(FALSE) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > arrow::as_arrow_table() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > It seems to revert to arrow dplyr query in the following line. > [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17738) [R] dplyr::compute converts from grouped arrow_dplyr_query to arrow Table
[ https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17738: -- Summary: [R] dplyr::compute converts from grouped arrow_dplyr_query to arrow Table (was: [R] dplyr::compute does not work for grouped arrow dplyr query) > [R] dplyr::compute converts from grouped arrow_dplyr_query to arrow Table > - > > Key: ARROW-17738 > URL: https://issues.apache.org/jira/browse/ARROW-17738 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > It is expected that {{dplyr::compute()}} will perform the calculation on the > arrow dplyr query and convert it to a Table, but it does not seem to work > correctly for grouped arrow dplyr queries and does not result in a Table. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> > dplyr::compute() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > {{as_arrow_table()}} works fine. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > dplyr::collect(FALSE) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > arrow::as_arrow_table() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > It seems to revert to arrow dplyr query in the following line. > [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query
[ https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606134#comment-17606134 ] SHIMA Tatsuya commented on ARROW-17738: --- Ah, is this the intended behavior? I didn't understand why this behavior was intended, I think compute should return a Table here, just as dbplyr and dtplyr do. > [R] dplyr::compute does not work for grouped arrow dplyr query > -- > > Key: ARROW-17738 > URL: https://issues.apache.org/jira/browse/ARROW-17738 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > It is expected that {{dplyr::compute()}} will perform the calculation on the > arrow dplyr query and convert it to a Table, but it does not seem to work > correctly for grouped arrow dplyr queries and does not result in a Table. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> > dplyr::compute() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > {{as_arrow_table()}} works fine. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > dplyr::collect(FALSE) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > arrow::as_arrow_table() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > It seems to revert to arrow dplyr query in the following line. > [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query
[ https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya reassigned ARROW-17738: - Assignee: SHIMA Tatsuya > [R] dplyr::compute does not work for grouped arrow dplyr query > -- > > Key: ARROW-17738 > URL: https://issues.apache.org/jira/browse/ARROW-17738 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > > It is expected that {{dplyr::compute()}} will perform the calculation on the > arrow dplyr query and convert it to a Table, but it does not seem to work > correctly for grouped arrow dplyr queries and does not result in a Table. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> > dplyr::compute() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > {{as_arrow_table()}} works fine. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > dplyr::collect(FALSE) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > arrow::as_arrow_table() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > It seems to revert to arrow dplyr query in the following line. > [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query
[ https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya reassigned ARROW-17738: - Assignee: (was: SHIMA Tatsuya) > [R] dplyr::compute does not work for grouped arrow dplyr query > -- > > Key: ARROW-17738 > URL: https://issues.apache.org/jira/browse/ARROW-17738 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > It is expected that {{dplyr::compute()}} will perform the calculation on the > arrow dplyr query and convert it to a Table, but it does not seem to work > correctly for grouped arrow dplyr queries and does not result in a Table. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> > dplyr::compute() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > {{as_arrow_table()}} works fine. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > dplyr::collect(FALSE) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > arrow::as_arrow_table() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > It seems to revert to arrow dplyr query in the following line. > [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query
[ https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya reassigned ARROW-17738: - Assignee: SHIMA Tatsuya > [R] dplyr::compute does not work for grouped arrow dplyr query > -- > > Key: ARROW-17738 > URL: https://issues.apache.org/jira/browse/ARROW-17738 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > > It is expected that {{dplyr::compute()}} will perform the calculation on the > arrow dplyr query and convert it to a Table, but it does not seem to work > correctly for grouped arrow dplyr queries and does not result in a Table. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> > dplyr::compute() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > {{as_arrow_table()}} works fine. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > dplyr::collect(FALSE) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > arrow::as_arrow_table() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > It seems to revert to arrow dplyr query in the following line. > [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query
[ https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606116#comment-17606116 ] SHIMA Tatsuya commented on ARROW-17738: --- I have updated the description. Grouped arrow dplyr queries are not converted to tables by {{dplyr::compute}}. > [R] dplyr::compute does not work for grouped arrow dplyr query > -- > > Key: ARROW-17738 > URL: https://issues.apache.org/jira/browse/ARROW-17738 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > It is expected that {{dplyr::compute()}} will perform the calculation on the > arrow dplyr query and convert it to a Table, but it does not seem to work > correctly for grouped arrow dplyr queries and does not result in a Table. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> > dplyr::compute() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > {{as_arrow_table()}} works fine. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > dplyr::collect(FALSE) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > arrow::as_arrow_table() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > It seems to revert to arrow dplyr query in the following line. > [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query
[ https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17738: -- Description: It is expected that {{dplyr::compute()}} will perform the calculation on the arrow dplyr query and convert it to a Table, but it does not seem to work correctly for grouped arrow dplyr queries and does not result in a Table. {code:r} mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> dplyr::compute() |> class() #> [1] "Table""ArrowTabular" "ArrowObject" "R6" {code} {{as_arrow_table()}} works fine. {code:r} mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::collect(FALSE) |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> arrow::as_arrow_table() |> class() #> [1] "Table""ArrowTabular" "ArrowObject" "R6" {code} It seems to revert to arrow dplyr query in the following line. [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75] was: It is expected that {{dplyr::compute()}} will perform the calculation on the arrow dplyr query and convert it to a Table, but it does not seem to work correctly for grouped arrow dplyr queries and does not result in a Table. {code:r} mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> dplyr::compute() |> class() #> [1] "Table""ArrowTabular" "ArrowObject" "R6" {code} {{as_arrow_table()}} works fine. {code:r} mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::collect(FALSE) |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> arrow::as_arrow_table() |> class() #> [1] "Table""ArrowTabular" "ArrowObject" "R6" {code} It seems to revert to arrow dplyr query in the following line. [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75] dplyr::compute() がアロー dplyr クエリで計算を実行し、それをテーブルに変換することが期待されますが、グループ化されたアロー dplyr クエリでは正しく機能しないようで、結果がテーブルになりません。 > [R] dplyr::compute does not work for grouped arrow dplyr query > -- > > Key: ARROW-17738 > URL: https://issues.apache.org/jira/browse/ARROW-17738 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > It is expected that {{dplyr::compute()}} will perform the calculation on the > arrow dplyr query and convert it to a Table, but it does not seem to work > correctly for grouped arrow dplyr queries and does not result in a Table. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> > dplyr::compute() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > {{as_arrow_table()}} works fine. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > dplyr::collect(FALSE) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > arrow::as_arrow_table() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > It seems to revert to arrow dplyr query in the following line. > [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query
[ https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17738: -- Description: It is expected that {{dplyr::compute()}} will perform the calculation on the arrow dplyr query and convert it to a Table, but it does not seem to work correctly for grouped arrow dplyr queries and does not result in a Table. {code:r} mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> dplyr::compute() |> class() #> [1] "Table""ArrowTabular" "ArrowObject" "R6" {code} {{as_arrow_table()}} works fine. {code:r} mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::collect(FALSE) |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> arrow::as_arrow_table() |> class() #> [1] "Table""ArrowTabular" "ArrowObject" "R6" {code} It seems to revert to arrow dplyr query in the following line. [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75] dplyr::compute() がアロー dplyr クエリで計算を実行し、それをテーブルに変換することが期待されますが、グループ化されたアロー dplyr クエリでは正しく機能しないようで、結果がテーブルになりません。 was: {code:r} mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::collect(FALSE) |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> arrow::as_arrow_table() |> class() #> [1] "Table""ArrowTabular" "ArrowObject" "R6" {code} It seems to revert to arrow dplyr query in the following line. https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75 > [R] dplyr::compute does not work for grouped arrow dplyr query > -- > > Key: ARROW-17738 > URL: https://issues.apache.org/jira/browse/ARROW-17738 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > It is expected that {{dplyr::compute()}} will perform the calculation on the > arrow dplyr query and convert it to a Table, but it does not seem to work > correctly for grouped arrow dplyr queries and does not result in a Table. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> > dplyr::compute() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > {{as_arrow_table()}} works fine. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > dplyr::collect(FALSE) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > arrow::as_arrow_table() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > It seems to revert to arrow dplyr query in the following line. > [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75] > > > dplyr::compute() がアロー dplyr クエリで計算を実行し、それをテーブルに変換することが期待されますが、グループ化されたアロー > dplyr クエリでは正しく機能しないようで、結果がテーブルになりません。 > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query
[ https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17738: -- Issue Type: Bug (was: Improvement) > [R] dplyr::compute does not work for grouped arrow dplyr query > -- > > Key: ARROW-17738 > URL: https://issues.apache.org/jira/browse/ARROW-17738 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> > class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > dplyr::collect(FALSE) |> class() > #> [1] "arrow_dplyr_query" > mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> > arrow::as_arrow_table() |> class() > #> [1] "Table""ArrowTabular" "ArrowObject" "R6" > {code} > It seems to revert to arrow dplyr query in the following line. > https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17737) [R] Continue to retain grouping metadata even if ungroup arrow dplyr query
[ https://issues.apache.org/jira/browse/ARROW-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17737: -- Issue Type: Bug (was: Improvement) > [R] Continue to retain grouping metadata even if ungroup arrow dplyr query > -- > > Key: ARROW-17737 > URL: https://issues.apache.org/jira/browse/ARROW-17737 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it > becomes an arrow dplyr query. > And it must also be written back again when converted to a Table. > {code:r} > mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> > as.data.frame() |> dplyr::group_vars() > #> character(0) > mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> > as.data.frame() |> dplyr::group_vars() > #> [1] "cyl" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query
SHIMA Tatsuya created ARROW-17738: - Summary: [R] dplyr::compute does not work for grouped arrow dplyr query Key: ARROW-17738 URL: https://issues.apache.org/jira/browse/ARROW-17738 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 9.0.0 Reporter: SHIMA Tatsuya {code:r} mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::collect(FALSE) |> class() #> [1] "arrow_dplyr_query" mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> arrow::as_arrow_table() |> class() #> [1] "Table""ArrowTabular" "ArrowObject" "R6" {code} It seems to revert to arrow dplyr query in the following line. https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17737) [R] Continue to retain grouping metadata even if ungroup arrow dplyr query
[ https://issues.apache.org/jira/browse/ARROW-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17737: -- Description: Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it becomes an arrow dplyr query. And it must also be written back again when converted to a Table. {code:r} mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> as.data.frame() |> dplyr::group_vars() #> character(0) mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> as.data.frame() |> dplyr::group_vars() #> [1] "cyl" {code} was: Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it becomes arrow dplyr query. And it must also be written back again when converted to Table. {code:r} mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> as.data.frame() |> dplyr::group_vars() #> character(0) mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> as.data.frame() |> dplyr::group_vars() #> [1] "cyl" {code} > [R] Continue to retain grouping metadata even if ungroup arrow dplyr query > -- > > Key: ARROW-17737 > URL: https://issues.apache.org/jira/browse/ARROW-17737 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it > becomes an arrow dplyr query. > And it must also be written back again when converted to a Table. > {code:r} > mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> > as.data.frame() |> dplyr::group_vars() > #> character(0) > mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> > as.data.frame() |> dplyr::group_vars() > #> [1] "cyl" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17737) [R] Continue to retain grouping metadata even if ungroup arrow dplyr query
[ https://issues.apache.org/jira/browse/ARROW-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17737: -- Description: Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it becomes arrow dplyr query. And it must also be written back again when converted to Table. {code:r} mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> as.data.frame() |> dplyr::group_vars() #> character(0) mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> as.data.frame() |> dplyr::group_vars() #> [1] "cyl" {code} was: Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it becomes arrow dplyr query. {code:r} mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> as.data.frame() |> dplyr::group_vars() #> character(0) mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> as.data.frame() |> dplyr::group_vars() #> [1] "cyl" {code} > [R] Continue to retain grouping metadata even if ungroup arrow dplyr query > -- > > Key: ARROW-17737 > URL: https://issues.apache.org/jira/browse/ARROW-17737 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it > becomes arrow dplyr query. > And it must also be written back again when converted to Table. > {code:r} > mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> > as.data.frame() |> dplyr::group_vars() > #> character(0) > mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> > as.data.frame() |> dplyr::group_vars() > #> [1] "cyl" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17737) [R] Continue to retain grouping metadata even if ungroup arrow dplyr query
SHIMA Tatsuya created ARROW-17737: - Summary: [R] Continue to retain grouping metadata even if ungroup arrow dplyr query Key: ARROW-17737 URL: https://issues.apache.org/jira/browse/ARROW-17737 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 9.0.0 Reporter: SHIMA Tatsuya Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it becomes arrow dplyr query. {code:r} mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> as.data.frame() |> dplyr::group_vars() #> character(0) mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> as.data.frame() |> dplyr::group_vars() #> [1] "cyl" {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17727) [R] Implement dplyr::across() inside group_by()
[ https://issues.apache.org/jira/browse/ARROW-17727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya resolved ARROW-17727. --- Resolution: Duplicate I'm sorry, I created this without noticing ARROW-17689. > [R] Implement dplyr::across() inside group_by() > --- > > Key: ARROW-17727 > URL: https://issues.apache.org/jira/browse/ARROW-17727 > Project: Apache Arrow > Issue Type: Improvement >Affects Versions: 10.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17689) [R] Implement dplyr::across() inside group_by()
[ https://issues.apache.org/jira/browse/ARROW-17689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya reassigned ARROW-17689: - Assignee: SHIMA Tatsuya > [R] Implement dplyr::across() inside group_by() > --- > > Key: ARROW-17689 > URL: https://issues.apache.org/jira/browse/ARROW-17689 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: SHIMA Tatsuya >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17727) [R] Implement dplyr::across() inside group_by()
SHIMA Tatsuya created ARROW-17727: - Summary: [R] Implement dplyr::across() inside group_by() Key: ARROW-17727 URL: https://issues.apache.org/jira/browse/ARROW-17727 Project: Apache Arrow Issue Type: Improvement Affects Versions: 10.0.0 Reporter: SHIMA Tatsuya Assignee: SHIMA Tatsuya -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17724) [R] Allow package name prefix inside dplyr::across's .fns argument
SHIMA Tatsuya created ARROW-17724: - Summary: [R] Allow package name prefix inside dplyr::across's .fns argument Key: ARROW-17724 URL: https://issues.apache.org/jira/browse/ARROW-17724 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 10.0.0 Reporter: SHIMA Tatsuya This is not a major issue, but may be worth mentioning as a known limitation. {code:r} library(dplyr, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) mtcars |> arrow::arrow_table() |> mutate(across(starts_with("c"), base::as.character)) |> collect() #> Error in base(cyl): could not find function "base" {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17416) [R] Implement lubridate::with_tz and lubridate::force_tz
[ https://issues.apache.org/jira/browse/ARROW-17416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya reassigned ARROW-17416: - Assignee: SHIMA Tatsuya > [R] Implement lubridate::with_tz and lubridate::force_tz > > > Key: ARROW-17416 > URL: https://issues.apache.org/jira/browse/ARROW-17416 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17674) [R] Implement dplyr::across() inside arrange()
SHIMA Tatsuya created ARROW-17674: - Summary: [R] Implement dplyr::across() inside arrange() Key: ARROW-17674 URL: https://issues.apache.org/jira/browse/ARROW-17674 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 9.0.0 Reporter: SHIMA Tatsuya Assignee: SHIMA Tatsuya -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17673) [R] `desc` in `dplyr::arrange` should allow `dplyr::` prefix
[ https://issues.apache.org/jira/browse/ARROW-17673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17673: -- Summary: [R] `desc` in `dplyr::arrange` should allow `dplyr::` prefix (was: [R] desc in dplyr::arrange should allow dplyr:: prefix) > [R] `desc` in `dplyr::arrange` should allow `dplyr::` prefix > > > Key: ARROW-17673 > URL: https://issues.apache.org/jira/browse/ARROW-17673 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > > This example works. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::arrange(desc(cyl)) |> > dplyr::collect() > {code} > But next one is not supported now. > {code:r} > mtcars |> arrow::arrow_table() |> dplyr::arrange(dplyr::desc(cyl)) |> > dplyr::collect() > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17673) [R] desc in dplyr::arrange should allow dplyr:: prefix
SHIMA Tatsuya created ARROW-17673: - Summary: [R] desc in dplyr::arrange should allow dplyr:: prefix Key: ARROW-17673 URL: https://issues.apache.org/jira/browse/ARROW-17673 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 9.0.0 Reporter: SHIMA Tatsuya Assignee: SHIMA Tatsuya This example works. {code:r} mtcars |> arrow::arrow_table() |> dplyr::arrange(desc(cyl)) |> dplyr::collect() {code} But next one is not supported now. {code:r} mtcars |> arrow::arrow_table() |> dplyr::arrange(dplyr::desc(cyl)) |> dplyr::collect() {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17432) [R] messed up rows when importing large csv into parquet
[ https://issues.apache.org/jira/browse/ARROW-17432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584903#comment-17584903 ] SHIMA Tatsuya commented on ARROW-17432: --- Hi, how about passing the schema to the {{col_types}} argument? {code:r} csv_stream <- open_dataset(csv_file, format = "csv", col_types = sch) {code} Or, using {{readr::read_csv()}}? I also wonder if the number of rows in the dataset fetched is the same in all cases. > [R] messed up rows when importing large csv into parquet > > > Key: ARROW-17432 > URL: https://issues.apache.org/jira/browse/ARROW-17432 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0, 9.0.0 > Environment: R version 4.2.1 > Running in Arch Linux - EndeavourOS > arrow_info() > Arrow package version: 9.0.0 > Capabilities: > > datasetTRUE > substrait FALSE > parquetTRUE > json TRUE > s3 TRUE > gcsTRUE > utf8proc TRUE > re2TRUE > snappy TRUE > gzip TRUE > brotli TRUE > zstd TRUE > lz4TRUE > lz4_frame TRUE > lzo FALSE > bz2TRUE > jemalloc TRUE > mimalloc TRUE > Memory: > > Allocator jemalloc > Current 49.31 Kb > Max1.63 Mb > Runtime: > > SIMD Level avx2 > Detected SIMD Level avx2 > Build: > > C++ Library Version 9.0.0 > C++ Compiler GNU > C++ Compiler Version 7.5.0 > > print(pa.__version__) > 9.0.0 >Reporter: Guillermo Duran >Priority: Major > > This is a weird issue that creates new rows when importing a large csv (56 > GB) into parquet in R. It occurred with both R Arrow 8.0.0 and 9.0.0 BUT > didn't occur with the Python Arrow library 9.0.0. Due to the large size of > the original csv it's difficult to create a reproducible example, but I share > the code and outputs. > The code I use in R to import the csv: > {code:java} > library(arrow) > library(dplyr) > > csv_file <- "/ebird_erd2021/full/obs.csv" > dest <- "/ebird_erd2021/full/obs_parquet/" > sch = arrow::schema(checklist_id = float32(), > species_code = string(), > exotic_category = float32(), > obs_count = float32(), > only_presence_reported = float32(), > only_slash_reported = float32(), > valid = float32(), > reviewed = float32(), > has_media = float32() > ) > csv_stream <- open_dataset(csv_file, format = "csv", > schema = sch, skip_rows = 1) > write_dataset(csv_stream, dest, format = "parquet", > max_rows_per_file=100L, > hive_style = TRUE, > existing_data_behavior = "overwrite"){code} > When I load the dataset and check one random _checklist_id_ I get rows that > are not part of the _obs.csv_ file. There shouldn't be duplicated species in > a checklist but there are ({_}amerob{_} for example)... also note that the > duplicated species have different {_}obs_count{_}. 50 species in total in > that specific {_}checklist_id{_}. > {code:java} > parquet_arrow <- open_dataset(dest, format = "parquet") > parquet_arrow |> > filter(checklist_id == 18543372) |> > arrange(species_code) |> > collect() > # A tibble: 50 × 3 >checklist_id species_code obs_count > > 1 18543372 altori 3 > 2 18543372 amekes 1 > 3 18543372 amered 40 > 4 18543372 amerob 30 > 5 18543372 amerob 9 > 6 18543372 balori 9 > 7 18543372 blkter 9 > 8 18543372 blkvul 20 > 9 18543372 buggna 1 > 10 18543372 buwwar 1 > # … with 40 more rows > # ℹ Use `print(n = ...)` to see more rows{code} > If I use awk to query the csv file with that same checklist id, I get > something different: > {code:java} > $ awk -F "," '{ if ($1 == 18543372) { print } }' obs.csv > 18543372.0,rewbla,,60.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,amerob,,30.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,robgro,,2.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,eastow,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,sedwre1,,2.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,ovenbi1,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,buggna,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,reshaw,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,turvul,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,gowwar,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,balori,,9.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,buwwar,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,grycat,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,cangoo,,6.0,0.0,0.0,1.0,0.0,0.
[jira] [Commented] (ARROW-17439) [R] pull() should compute() not collect()
[ https://issues.apache.org/jira/browse/ARROW-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583872#comment-17583872 ] SHIMA Tatsuya commented on ARROW-17439: --- Since {{pull()}} can have additional arguments, why not add an argument that controls whether it should return an arrow structure or an R vector, like the {{as_data_frame}} argument that {{read_csv_arrow()}} and others have? > [R] pull() should compute() not collect() > - > > Key: ARROW-17439 > URL: https://issues.apache.org/jira/browse/ARROW-17439 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Bryce Mecum >Priority: Major > Labels: good-first-issue > > Currently {{pull()}} returns an R vector, but it's the only dplyr verb other > than {{collect()}} that returns an R data structure. And there's no other > natural way to extract a ChunkedArray from the result of an arrow query. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-15734) [R][Docs] Enable searching R docs
[ https://issues.apache.org/jira/browse/ARROW-15734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya closed ARROW-15734. - Resolution: Won't Fix > [R][Docs] Enable searching R docs > - > > Key: ARROW-15734 > URL: https://issues.apache.org/jira/browse/ARROW-15734 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Attachments: bs5.png, fixed-bs5.png, > image-2022-03-01-00-33-12-050.png, image-2022-03-01-00-46-51-350.png, > updated-list.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Enable Bootstrap 5 in pkgdown website to use the built-in search feature. > Do you have any plans to switch to Bootstrap 5? > https://pkgdown.r-lib.org/articles/search.html > https://pkgdown.r-lib.org/articles/customise.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17485) [R] Allow TRUE/FALSE to the compression option of `write_feather` (`write_ipc_file`)
[ https://issues.apache.org/jira/browse/ARROW-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17485: -- Summary: [R] Allow TRUE/FALSE to the compression option of `write_feather` (`write_ipc_file`) (was: [R] Allow TRUE/FALSE to the compression option of `write_feather`(`write_ipc_file`)) > [R] Allow TRUE/FALSE to the compression option of `write_feather` > (`write_ipc_file`) > > > Key: ARROW-17485 > URL: https://issues.apache.org/jira/browse/ARROW-17485 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > We may want to create an uncompressed IPC file to share with JavaScript. > https://quarto.org/docs/interactive/ojs/data-sources.html#file-attachments > Currently, to do this, we need to set up the following, but the string > "uncompressed" is long and does not benefit from auto-completion by the IDE, > making it difficult to write code. > {code:r} > arrow::write_feather(mtcars, "data.arrow", compression = "uncompressed") > {code} > It would be useful to write the following. > {code:r} > arrow::write_feather(mtcars, "data.arrow", compression = FALSE) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17485) [R] Allow TRUE/FALSE to the compression option of `write_feather`(`write_ipc_file`)
[ https://issues.apache.org/jira/browse/ARROW-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17485: -- Summary: [R] Allow TRUE/FALSE to the compression option of `write_feather`(`write_ipc_file`) (was: [R] Allow TRUE/FALSE to the compression option of `write_feather`) > [R] Allow TRUE/FALSE to the compression option of > `write_feather`(`write_ipc_file`) > --- > > Key: ARROW-17485 > URL: https://issues.apache.org/jira/browse/ARROW-17485 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > We may want to create an uncompressed IPC file to share with JavaScript. > https://quarto.org/docs/interactive/ojs/data-sources.html#file-attachments > Currently, to do this, we need to set up the following, but the string > "uncompressed" is long and does not benefit from auto-completion by the IDE, > making it difficult to write code. > {code:r} > arrow::write_feather(mtcars, "data.arrow", compression = "uncompressed") > {code} > It would be useful to write the following. > {code:r} > arrow::write_feather(mtcars, "data.arrow", compression = FALSE) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17485) [R] Allow TRUE/FALSE to the compression option of `write_feather`
[ https://issues.apache.org/jira/browse/ARROW-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17485: -- Summary: [R] Allow TRUE/FALSE to the compression option of `write_feather` (was: [R] Allow FALSE to the compression option of `write_feather`) > [R] Allow TRUE/FALSE to the compression option of `write_feather` > - > > Key: ARROW-17485 > URL: https://issues.apache.org/jira/browse/ARROW-17485 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > > We may want to create an uncompressed IPC file to share with JavaScript. > https://quarto.org/docs/interactive/ojs/data-sources.html#file-attachments > Currently, to do this, we need to set up the following, but the string > "uncompressed" is long and does not benefit from auto-completion by the IDE, > making it difficult to write code. > {code:r} > arrow::write_feather(mtcars, "data.arrow", compression = "uncompressed") > {code} > It would be useful to write the following. > {code:r} > arrow::write_feather(mtcars, "data.arrow", compression = FALSE) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17485) [R] Allow FALSE to the compression option of `write_feather`
[ https://issues.apache.org/jira/browse/ARROW-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya reassigned ARROW-17485: - Assignee: SHIMA Tatsuya > [R] Allow FALSE to the compression option of `write_feather` > > > Key: ARROW-17485 > URL: https://issues.apache.org/jira/browse/ARROW-17485 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > > We may want to create an uncompressed IPC file to share with JavaScript. > https://quarto.org/docs/interactive/ojs/data-sources.html#file-attachments > Currently, to do this, we need to set up the following, but the string > "uncompressed" is long and does not benefit from auto-completion by the IDE, > making it difficult to write code. > {code:r} > arrow::write_feather(mtcars, "data.arrow", compression = "uncompressed") > {code} > It would be useful to write the following. > {code:r} > arrow::write_feather(mtcars, "data.arrow", compression = FALSE) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17485) [R] Allow FALSE to the compression option of `write_feather`
SHIMA Tatsuya created ARROW-17485: - Summary: [R] Allow FALSE to the compression option of `write_feather` Key: ARROW-17485 URL: https://issues.apache.org/jira/browse/ARROW-17485 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 9.0.0 Reporter: SHIMA Tatsuya We may want to create an uncompressed IPC file to share with JavaScript. https://quarto.org/docs/interactive/ojs/data-sources.html#file-attachments Currently, to do this, we need to set up the following, but the string "uncompressed" is long and does not benefit from auto-completion by the IDE, making it difficult to write code. {code:r} arrow::write_feather(mtcars, "data.arrow", compression = "uncompressed") {code} It would be useful to write the following. {code:r} arrow::write_feather(mtcars, "data.arrow", compression = FALSE) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17439) [R] pull() should compute() not collect()
[ https://issues.apache.org/jira/browse/ARROW-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581935#comment-17581935 ] SHIMA Tatsuya commented on ARROW-17439: --- Note that in dbplyr and dtplyr, pull returns a vector in R (as it should). https://dbplyr.tidyverse.org/reference/pull.tbl_sql.html I think this change in behavior makes sense, but may confuse users in its current state without references. > [R] pull() should compute() not collect() > - > > Key: ARROW-17439 > URL: https://issues.apache.org/jira/browse/ARROW-17439 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Bryce Mecum >Priority: Major > Labels: good-first-issue > > Currently {{pull()}} returns an R vector, but it's the only dplyr verb other > than {{collect()}} that returns an R data structure. And there's no other > natural way to extract a ChunkedArray from the result of an arrow query. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17429) [R] Error messages are not helpful of read_csv_arrow with col_types option
[ https://issues.apache.org/jira/browse/ARROW-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya reassigned ARROW-17429: - Assignee: SHIMA Tatsuya > [R] Error messages are not helpful of read_csv_arrow with col_types option > -- > > Key: ARROW-17429 > URL: https://issues.apache.org/jira/browse/ARROW-17429 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > > The error message displayed when a non-convertible type is specified does not > seem to help in the development version. > {code:r} > tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) > csv_file <- tempfile() > on.exit(unlink(csv_file)) > write.csv(tbl, csv_file, row.names = FALSE) > arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01 00:00:00 > arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01T12:00:00+12:00 > arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) > #> Error in as.data.frame(tab): object 'tab' not found > arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) > #> Error in as.data.frame(tab): object 'tab' not found > {code} > In arrow 9.0.0 > {code:r} > tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) > csv_file <- tempfile() > on.exit(unlink(csv_file)) > write.csv(tbl, csv_file, row.names = FALSE) > arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01 00:00:00 > arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01T12:00:00+12:00 > arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) > #> Error: > #> ! Invalid: In CSV column #0: CSV conversion error to int32: invalid value > '1970-01-01T12:00:00+12:00' > arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) > #> Error: > #> ! Invalid: In CSV column #0: CSV conversion error to timestamp[ns]: > expected no zone offset in '1970-01-01T12:00:00+12:00' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17429) [R] Error messages are not helpful of read_csv_arrow with col_types option
[ https://issues.apache.org/jira/browse/ARROW-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581850#comment-17581850 ] SHIMA Tatsuya commented on ARROW-17429: --- This issue appears to have been introduced by https://github.com/apache/arrow/pull/12826. > [R] Error messages are not helpful of read_csv_arrow with col_types option > -- > > Key: ARROW-17429 > URL: https://issues.apache.org/jira/browse/ARROW-17429 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: SHIMA Tatsuya >Priority: Major > > The error message displayed when a non-convertible type is specified does not > seem to help in the development version. > {code:r} > tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) > csv_file <- tempfile() > on.exit(unlink(csv_file)) > write.csv(tbl, csv_file, row.names = FALSE) > arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01 00:00:00 > arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01T12:00:00+12:00 > arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) > #> Error in as.data.frame(tab): object 'tab' not found > arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) > #> Error in as.data.frame(tab): object 'tab' not found > {code} > In arrow 9.0.0 > {code:r} > tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) > csv_file <- tempfile() > on.exit(unlink(csv_file)) > write.csv(tbl, csv_file, row.names = FALSE) > arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01 00:00:00 > arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01T12:00:00+12:00 > arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) > #> Error: > #> ! Invalid: In CSV column #0: CSV conversion error to int32: invalid value > '1970-01-01T12:00:00+12:00' > arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) > #> Error: > #> ! Invalid: In CSV column #0: CSV conversion error to timestamp[ns]: > expected no zone offset in '1970-01-01T12:00:00+12:00' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17425) [R] `lubridate::as_datetime()` in dplyr query should be able to handle time in sub seconds
[ https://issues.apache.org/jira/browse/ARROW-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17425: -- Description: Since the current unit is fixed to "s", an error will occur if a time containing sub-seconds is given. {code:r} "1970-01-01T00:00:59.123456789" |> arrow::arrow_table(x = _) |> dplyr::mutate(x = lubridate::as_datetime(x, tz = "UTC")) |> dplyr::collect() #> Error in `dplyr::collect()`: #> ! Invalid: Failed to parse string: '1970-01-01T00:00:59.123456789' as a scalar of type timestamp[s] {code} I thought that nanoseconds should be used, but it should be noted that POSIXct is currently supposed to be converted to microseconds, as shown in ARROW-17424. was: Since the current unit is fixed to "s", an error will occur if a time containing sub-seconds is given. {code:r} "1970-01-01T00:00:59.123456789" |> data.frame(x = _) |> arrow::arrow_table() |> dplyr::mutate(x = lubridate::as_datetime(x, tz = "UTC")) |> dplyr::collect() #> Error in `dplyr::collect()`: #> ! Invalid: Failed to parse string: '1970-01-01T00:00:59.123456789' as a scalar of type timestamp[s] {code} I thought that nanoseconds should be used, but it should be noted that POSIXct is currently supposed to be converted to microseconds, as shown in ARROW-17424. > [R] `lubridate::as_datetime()` in dplyr query should be able to handle time > in sub seconds > -- > > Key: ARROW-17425 > URL: https://issues.apache.org/jira/browse/ARROW-17425 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Since the current unit is fixed to "s", an error will occur if a time > containing sub-seconds is given. > {code:r} > "1970-01-01T00:00:59.123456789" |> > arrow::arrow_table(x = _) |> > dplyr::mutate(x = lubridate::as_datetime(x, tz = "UTC")) |> > dplyr::collect() > #> Error in `dplyr::collect()`: > #> ! Invalid: Failed to parse string: '1970-01-01T00:00:59.123456789' as a > scalar of type timestamp[s] > {code} > I thought that nanoseconds should be used, but it should be noted that > POSIXct is currently supposed to be converted to microseconds, as shown in > ARROW-17424. > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17414) [R]: Lack of `assume_timezone` binding
[ https://issues.apache.org/jira/browse/ARROW-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580336#comment-17580336 ] SHIMA Tatsuya commented on ARROW-17414: --- Yes. Perhaps if dplyr is not used, {{call_function}} should be used, and it would be great if that could be indicated in the error message as well. > [R]: Lack of `assume_timezone` binding > -- > > Key: ARROW-17414 > URL: https://issues.apache.org/jira/browse/ARROW-17414 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > If we run the following code in R, we will get a C++ derived error message > telling us to use {{assume_timezone}}. > However, this error message is not helpful because there is no binding for > the {{assume_timezone}} function in R. > {code:r} > tf <- tempfile() > writeLines("2004-04-01 12:00", tf) > arrow::read_csv_arrow(tf, schema = arrow::schema(col1 = arrow::timestamp("s", > "UTC"))) > #> Error: > #> ! Invalid: In CSV column #0: CSV conversion error to timestamp[s, tz=UTC]: > expected a zone offset in '2004-04-01 12:00'. If these timestamps are in > local time, parse them as timestamps without timezone, then call > assume_timezone. > #> ℹ If you have supplied a schema and your data contains a header row, you > should supply the argument `skip = 1` to prevent the header being read in as > data. > {code} > It would be useful to improve the error message or to allow > {{assume_timezone}} to be used from R as well. > (although {{lubridate::with_tz()}} and {{lubridate::force_tz()}} could be > more useful within a dplyr query) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17424) [R] Microsecond is not sufficient unit for POSIXct
[ https://issues.apache.org/jira/browse/ARROW-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17424: -- Description: I believe the {{POSIXct}} type or R currently corresponds to the Arrow {{timestamp[us, tz=UTC]}} type. {code:r} lubridate::as_datetime(0) |> arrow::infer_type() #> Timestamp #> timestamp[us, tz=UTC] {code} {code:r} lubridate::as_datetime("1970-01-01 00:00:00.001") |> arrow::arrow_table(x = _) #> Table #> 1 rows x 1 columns #> $x {code} {code:r} df_a <- lubridate::as_datetime("1970-01-01 00:00:00.001") |> arrow::arrow_table(x = _) |> as.data.frame() df_b <- lubridate::as_datetime("1970-01-01 00:00:00.001") |> tibble::tibble(x = _) waldo::compare(df_a, df_b) #> `old$x`: "1970-01-01" #> `new$x`: "1970-01-01 00:00:00" {code} However, as shown below, POSIXct may hold data finer than a microsecond. {code:r} lubridate::as_datetime(0.1) |> as.numeric() #> [1] 1e-09 lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric() #> [1] 1.192093e-07 {code} I don't know why it is currently set in microseconds, but is there any reason not to set it in nanoseconds? was: I believe the {{POSIXct}} type or R currently corresponds to the Arrow {{timestamp[us, tz=UTC]}} type. {code:r} lubridate::as_datetime(0) |> arrow::infer_type() #> Timestamp #> timestamp[us, tz=UTC] {code} {code:r} df_a <- lubridate::as_datetime("1970-01-01 00:00:00.001") |> arrow::arrow_table(x = _) |> as.data.frame() df_b <- lubridate::as_datetime("1970-01-01 00:00:00.001") |> tibble::tibble(x = _) waldo::compare(df_a, df_b) #> `old$x`: "1970-01-01" #> `new$x`: "1970-01-01 00:00:00" {code} However, as shown below, POSIXct may hold data finer than a microsecond. {code:r} lubridate::as_datetime(0.1) |> as.numeric() #> [1] 1e-09 lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric() #> [1] 1.192093e-07 {code} I don't know why it is currently set in microseconds, but is there any reason not to set it in nanoseconds? > [R] Microsecond is not sufficient unit for POSIXct > -- > > Key: ARROW-17424 > URL: https://issues.apache.org/jira/browse/ARROW-17424 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > I believe the {{POSIXct}} type or R currently corresponds to the Arrow > {{timestamp[us, tz=UTC]}} type. > {code:r} > lubridate::as_datetime(0) |> arrow::infer_type() > #> Timestamp > #> timestamp[us, tz=UTC] > {code} > {code:r} > lubridate::as_datetime("1970-01-01 00:00:00.001") |> > arrow::arrow_table(x = _) > #> Table > #> 1 rows x 1 columns > #> $x > {code} > {code:r} > df_a <- lubridate::as_datetime("1970-01-01 00:00:00.001") |> > arrow::arrow_table(x = _) |> > as.data.frame() > df_b <- lubridate::as_datetime("1970-01-01 00:00:00.001") |> > tibble::tibble(x = _) > waldo::compare(df_a, df_b) > #> `old$x`: "1970-01-01" > #> `new$x`: "1970-01-01 00:00:00" > {code} > However, as shown below, POSIXct may hold data finer than a microsecond. > {code:r} > lubridate::as_datetime(0.1) |> as.numeric() > #> [1] 1e-09 > lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric() > #> [1] 1.192093e-07 > {code} > I don't know why it is currently set in microseconds, but is there any reason > not to set it in nanoseconds? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17424) [R] Microsecond is not sufficient unit for POSIXct
[ https://issues.apache.org/jira/browse/ARROW-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17424: -- Description: I believe the {{POSIXct}} type or R currently corresponds to the Arrow {{timestamp[us, tz=UTC]}} type. {code:r} lubridate::as_datetime(0) |> arrow::infer_type() #> Timestamp #> timestamp[us, tz=UTC] {code} {code:r} df_a <- lubridate::as_datetime("1970-01-01 00:00:00.001") |> arrow::arrow_table(x = _) |> as.data.frame() df_b <- lubridate::as_datetime("1970-01-01 00:00:00.001") |> tibble::tibble(x = _) waldo::compare(df_a, df_b) #> `old$x`: "1970-01-01" #> `new$x`: "1970-01-01 00:00:00" {code} However, as shown below, POSIXct may hold data finer than a microsecond. {code:r} lubridate::as_datetime(0.1) |> as.numeric() #> [1] 1e-09 lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric() #> [1] 1.192093e-07 {code} I don't know why it is currently set in microseconds, but is there any reason not to set it in nanoseconds? was: I believe the {{POSIXct}} type or R currently corresponds to the Arrow {{timestamp[us, tz=UTC]}} type. {code:r} lubridate::as_datetime(0) |> arrow::infer_type() #> Timestamp #> timestamp[us, tz=UTC] {code} However, as shown below, POSIXct may hold data finer than a microsecond. {code:r} lubridate::as_datetime(0.1) |> as.numeric() #> [1] 1e-09 lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric() #> [1] 1.192093e-07 {code} I don't know why it is currently set in microseconds, but is there any reason not to set it in nanoseconds? > [R] Microsecond is not sufficient unit for POSIXct > -- > > Key: ARROW-17424 > URL: https://issues.apache.org/jira/browse/ARROW-17424 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > I believe the {{POSIXct}} type or R currently corresponds to the Arrow > {{timestamp[us, tz=UTC]}} type. > {code:r} > lubridate::as_datetime(0) |> arrow::infer_type() > #> Timestamp > #> timestamp[us, tz=UTC] > {code} > {code:r} > df_a <- lubridate::as_datetime("1970-01-01 00:00:00.001") |> > arrow::arrow_table(x = _) |> > as.data.frame() > df_b <- lubridate::as_datetime("1970-01-01 00:00:00.001") |> > tibble::tibble(x = _) > waldo::compare(df_a, df_b) > #> `old$x`: "1970-01-01" > #> `new$x`: "1970-01-01 00:00:00" > {code} > However, as shown below, POSIXct may hold data finer than a microsecond. > {code:r} > lubridate::as_datetime(0.1) |> as.numeric() > #> [1] 1e-09 > lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric() > #> [1] 1.192093e-07 > {code} > I don't know why it is currently set in microseconds, but is there any reason > not to set it in nanoseconds? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17429) [R] Error messages are not helpful of read_csv_arrow with col_types option
[ https://issues.apache.org/jira/browse/ARROW-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17429: -- Description: The error message displayed when a non-convertible type is specified does not seem to help in the development version. {code:r} tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) csv_file <- tempfile() on.exit(unlink(csv_file)) write.csv(tbl, csv_file, row.names = FALSE) arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) #> # A tibble: 1 × 1 #> x #> #> 1 1970-01-01 00:00:00 arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) #> # A tibble: 1 × 1 #> x #> #> 1 1970-01-01T12:00:00+12:00 arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) #> Error in as.data.frame(tab): object 'tab' not found arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) #> Error in as.data.frame(tab): object 'tab' not found {code} was: The error message displayed when a non-convertible type is specified does not seem to help in the HEAD. {code:r} tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) csv_file <- tempfile() on.exit(unlink(csv_file)) write.csv(tbl, csv_file, row.names = FALSE) arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) #> # A tibble: 1 × 1 #> x #> #> 1 1970-01-01 00:00:00 arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) #> # A tibble: 1 × 1 #> x #> #> 1 1970-01-01T12:00:00+12:00 arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) #> Error in as.data.frame(tab): object 'tab' not found arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) #> Error in as.data.frame(tab): object 'tab' not found {code} > [R] Error messages are not helpful of read_csv_arrow with col_types option > -- > > Key: ARROW-17429 > URL: https://issues.apache.org/jira/browse/ARROW-17429 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > The error message displayed when a non-convertible type is specified does not > seem to help in the development version. > {code:r} > tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) > csv_file <- tempfile() > on.exit(unlink(csv_file)) > write.csv(tbl, csv_file, row.names = FALSE) > arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01 00:00:00 > arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01T12:00:00+12:00 > arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) > #> Error in as.data.frame(tab): object 'tab' not found > arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) > #> Error in as.data.frame(tab): object 'tab' not found > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17429) [R] Error messages are not helpful of read_csv_arrow with col_types option
[ https://issues.apache.org/jira/browse/ARROW-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17429: -- Affects Version/s: (was: 9.0.0) > [R] Error messages are not helpful of read_csv_arrow with col_types option > -- > > Key: ARROW-17429 > URL: https://issues.apache.org/jira/browse/ARROW-17429 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: SHIMA Tatsuya >Priority: Major > > The error message displayed when a non-convertible type is specified does not > seem to help in the development version. > {code:r} > tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) > csv_file <- tempfile() > on.exit(unlink(csv_file)) > write.csv(tbl, csv_file, row.names = FALSE) > arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01 00:00:00 > arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01T12:00:00+12:00 > arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) > #> Error in as.data.frame(tab): object 'tab' not found > arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) > #> Error in as.data.frame(tab): object 'tab' not found > {code} > In arrow 9.0.0 > {code:r} > tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) > csv_file <- tempfile() > on.exit(unlink(csv_file)) > write.csv(tbl, csv_file, row.names = FALSE) > arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01 00:00:00 > arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01T12:00:00+12:00 > arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) > #> Error: > #> ! Invalid: In CSV column #0: CSV conversion error to int32: invalid value > '1970-01-01T12:00:00+12:00' > arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) > #> Error: > #> ! Invalid: In CSV column #0: CSV conversion error to timestamp[ns]: > expected no zone offset in '1970-01-01T12:00:00+12:00' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17429) [R] Error messages are not helpful of read_csv_arrow with col_types option
[ https://issues.apache.org/jira/browse/ARROW-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17429: -- Description: The error message displayed when a non-convertible type is specified does not seem to help in the development version. {code:r} tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) csv_file <- tempfile() on.exit(unlink(csv_file)) write.csv(tbl, csv_file, row.names = FALSE) arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) #> # A tibble: 1 × 1 #> x #> #> 1 1970-01-01 00:00:00 arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) #> # A tibble: 1 × 1 #> x #> #> 1 1970-01-01T12:00:00+12:00 arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) #> Error in as.data.frame(tab): object 'tab' not found arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) #> Error in as.data.frame(tab): object 'tab' not found {code} In arrow 9.0.0 {code:r} tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) csv_file <- tempfile() on.exit(unlink(csv_file)) write.csv(tbl, csv_file, row.names = FALSE) arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) #> # A tibble: 1 × 1 #> x #> #> 1 1970-01-01 00:00:00 arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) #> # A tibble: 1 × 1 #> x #> #> 1 1970-01-01T12:00:00+12:00 arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) #> Error: #> ! Invalid: In CSV column #0: CSV conversion error to int32: invalid value '1970-01-01T12:00:00+12:00' arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) #> Error: #> ! Invalid: In CSV column #0: CSV conversion error to timestamp[ns]: expected no zone offset in '1970-01-01T12:00:00+12:00' {code} was: The error message displayed when a non-convertible type is specified does not seem to help in the development version. {code:r} tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) csv_file <- tempfile() on.exit(unlink(csv_file)) write.csv(tbl, csv_file, row.names = FALSE) arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) #> # A tibble: 1 × 1 #> x #> #> 1 1970-01-01 00:00:00 arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) #> # A tibble: 1 × 1 #> x #> #> 1 1970-01-01T12:00:00+12:00 arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) #> Error in as.data.frame(tab): object 'tab' not found arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) #> Error in as.data.frame(tab): object 'tab' not found {code} > [R] Error messages are not helpful of read_csv_arrow with col_types option > -- > > Key: ARROW-17429 > URL: https://issues.apache.org/jira/browse/ARROW-17429 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > The error message displayed when a non-convertible type is specified does not > seem to help in the development version. > {code:r} > tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) > csv_file <- tempfile() > on.exit(unlink(csv_file)) > write.csv(tbl, csv_file, row.names = FALSE) > arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01 00:00:00 > arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01T12:00:00+12:00 > arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) > #> Error in as.data.frame(tab): object 'tab' not found > arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) > #> Error in as.data.frame(tab): object 'tab' not found > {code} > In arrow 9.0.0 > {code:r} > tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) > csv_file <- tempfile() > on.exit(unlink(csv_file)) > write.csv(tbl, csv_file, row.names = FALSE) > arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01 00:00:00 > arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01T12:00:00+12:00 > arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) > #> Error: > #> ! Invalid: In CSV column #0: CSV conversion error to int32: invalid value > '1970-01-01T12:00:00+12:00' > arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) > #> Error: > #> ! Invalid: In CSV column #0: CSV conversion error to timestamp[ns]: > expected no zone offset in '1970-01-01T12:00:00+12:00' > {code}
[jira] [Updated] (ARROW-17429) [R] Error messages are not helpful of read_csv_arrow with col_types option
[ https://issues.apache.org/jira/browse/ARROW-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17429: -- Description: The error message displayed when a non-convertible type is specified does not seem to help in the HEAD. {code:r} tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) csv_file <- tempfile() on.exit(unlink(csv_file)) write.csv(tbl, csv_file, row.names = FALSE) arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) #> # A tibble: 1 × 1 #> x #> #> 1 1970-01-01 00:00:00 arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) #> # A tibble: 1 × 1 #> x #> #> 1 1970-01-01T12:00:00+12:00 arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) #> Error in as.data.frame(tab): object 'tab' not found arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) #> Error in as.data.frame(tab): object 'tab' not found {code} was: The error message displayed when a non-convertible type is specified does not seem to help. {code:r} tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) csv_file <- tempfile() on.exit(unlink(csv_file)) write.csv(tbl, csv_file, row.names = FALSE) arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) #> # A tibble: 1 × 1 #> x #> #> 1 1970-01-01 00:00:00 arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) #> # A tibble: 1 × 1 #> x #> #> 1 1970-01-01T12:00:00+12:00 arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) #> Error in as.data.frame(tab): object 'tab' not found arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) #> Error in as.data.frame(tab): object 'tab' not found {code} > [R] Error messages are not helpful of read_csv_arrow with col_types option > -- > > Key: ARROW-17429 > URL: https://issues.apache.org/jira/browse/ARROW-17429 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > The error message displayed when a non-convertible type is specified does not > seem to help in the HEAD. > {code:r} > tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) > csv_file <- tempfile() > on.exit(unlink(csv_file)) > write.csv(tbl, csv_file, row.names = FALSE) > arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01 00:00:00 > arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) > #> # A tibble: 1 × 1 > #> x > #> > #> 1 1970-01-01T12:00:00+12:00 > arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) > #> Error in as.data.frame(tab): object 'tab' not found > arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) > #> Error in as.data.frame(tab): object 'tab' not found > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17429) [R] Error messages are not helpful of read_csv_arrow with col_types option
SHIMA Tatsuya created ARROW-17429: - Summary: [R] Error messages are not helpful of read_csv_arrow with col_types option Key: ARROW-17429 URL: https://issues.apache.org/jira/browse/ARROW-17429 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 9.0.0 Reporter: SHIMA Tatsuya The error message displayed when a non-convertible type is specified does not seem to help. {code:r} tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00")) csv_file <- tempfile() on.exit(unlink(csv_file)) write.csv(tbl, csv_file, row.names = FALSE) arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1) #> # A tibble: 1 × 1 #> x #> #> 1 1970-01-01 00:00:00 arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1) #> # A tibble: 1 × 1 #> x #> #> 1 1970-01-01T12:00:00+12:00 arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1) #> Error in as.data.frame(tab): object 'tab' not found arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1) #> Error in as.data.frame(tab): object 'tab' not found {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17428) [R] Implement as.integer and as.numeric for timestamp types etc. in Arrow dplyr queries
SHIMA Tatsuya created ARROW-17428: - Summary: [R] Implement as.integer and as.numeric for timestamp types etc. in Arrow dplyr queries Key: ARROW-17428 URL: https://issues.apache.org/jira/browse/ARROW-17428 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 9.0.0 Reporter: SHIMA Tatsuya In R, the POSIXct type are converted to seconds, so division and rounding are required within arrow, depending on the unit. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17414) [R]: Lack of `assume_timezone` binding
[ https://issues.apache.org/jira/browse/ARROW-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580063#comment-17580063 ] SHIMA Tatsuya commented on ARROW-17414: --- Thank you for pointing out how to call the compute functions. Ideally, it would be great if we could add the with_tz function and update the error messages? > [R]: Lack of `assume_timezone` binding > -- > > Key: ARROW-17414 > URL: https://issues.apache.org/jira/browse/ARROW-17414 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > If we run the following code in R, we will get a C++ derived error message > telling us to use {{assume_timezone}}. > However, this error message is not helpful because there is no binding for > the {{assume_timezone}} function in R. > {code:r} > tf <- tempfile() > writeLines("2004-04-01 12:00", tf) > arrow::read_csv_arrow(tf, schema = arrow::schema(col1 = arrow::timestamp("s", > "UTC"))) > #> Error: > #> ! Invalid: In CSV column #0: CSV conversion error to timestamp[s, tz=UTC]: > expected a zone offset in '2004-04-01 12:00'. If these timestamps are in > local time, parse them as timestamps without timezone, then call > assume_timezone. > #> ℹ If you have supplied a schema and your data contains a header row, you > should supply the argument `skip = 1` to prevent the header being read in as > data. > {code} > It would be useful to improve the error message or to allow > {{assume_timezone}} to be used from R as well. > (although {{lubridate::with_tz()}} and {{lubridate::force_tz()}} could be > more useful within a dplyr query) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17425) [R] `lubridate::as_datetime()` in dplyr query should be able to handle time in sub seconds
[ https://issues.apache.org/jira/browse/ARROW-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17425: -- Summary: [R] `lubridate::as_datetime()` in dplyr query should be able to handle time in sub seconds (was: [R] lubridate::as_datetime() etc. in dplyr query should be able to handle time in sub seconds) > [R] `lubridate::as_datetime()` in dplyr query should be able to handle time > in sub seconds > -- > > Key: ARROW-17425 > URL: https://issues.apache.org/jira/browse/ARROW-17425 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Since the current unit is fixed to "s", an error will occur if a time > containing sub-seconds is given. > {code:r} > "1970-01-01T00:00:59.123456789" |> > data.frame(x = _) |> > arrow::arrow_table() |> > dplyr::mutate(x = lubridate::as_datetime(x, tz = "UTC")) |> > dplyr::collect() > #> Error in `dplyr::collect()`: > #> ! Invalid: Failed to parse string: '1970-01-01T00:00:59.123456789' as a > scalar of type timestamp[s] > {code} > I thought that nanoseconds should be used, but it should be noted that > POSIXct is currently supposed to be converted to microseconds, as shown in > ARROW-17424. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17425) [R] lubridate::as_datetime() etc. in dplyr query should be able to handle time in sub seconds
[ https://issues.apache.org/jira/browse/ARROW-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya reassigned ARROW-17425: - Assignee: SHIMA Tatsuya > [R] lubridate::as_datetime() etc. in dplyr query should be able to handle > time in sub seconds > - > > Key: ARROW-17425 > URL: https://issues.apache.org/jira/browse/ARROW-17425 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > > Since the current unit is fixed to "s", an error will occur if a time > containing sub-seconds is given. > {code:r} > "1970-01-01T00:00:59.123456789" |> > data.frame(x = _) |> > arrow::arrow_table() |> > dplyr::mutate(x = lubridate::as_datetime(x, tz = "UTC")) |> > dplyr::collect() > #> Error in `dplyr::collect()`: > #> ! Invalid: Failed to parse string: '1970-01-01T00:00:59.123456789' as a > scalar of type timestamp[s] > {code} > I thought that nanoseconds should be used, but it should be noted that > POSIXct is currently supposed to be converted to microseconds, as shown in > ARROW-17424. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
[ https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579851#comment-17579851 ] SHIMA Tatsuya commented on ARROW-17374: --- I think what you are seeing in the error message is Python 3.1, not Python 3.10. I am not sure, but are you specifying the Python version in yaml or something? In yaml, if you just write {{3.10}}, it will be interpreted as the number {{3.1}}, so you need to string {{"3.10"}}. If this is an irrelevant comment, please ignore it. > [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND > -- > > Key: ARROW-17374 > URL: https://issues.apache.org/jira/browse/ARROW-17374 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0, 9.0.0, 8.0.1 > Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64 >Reporter: Shane Brennan >Priority: Blocker > > I've been trying to install Arrow on an R notebook within AWS SageMaker. > SageMaker provides Jupyter-like notebooks, with each instance running Amazon > Linux 2 as its OS, itself based on RHEL. > Trying to install a few ways, e.g., using the standard binaries, using the > nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all > still result in the following error. > {noformat} > x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared > -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common > -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags > -Wl,--gc-sections -Wl,--allow-shlib-undefined > -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib > -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib > -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o > array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o > compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o > expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o > json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o > recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o > schema.o symbols.o table.o threadpool.o type_infer.o > -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib > -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz > SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread > -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto > -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR > x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or > directory > make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: > arrow.so] Error 1{noformat} > Snappy is installed on the systems, and both shared object (.so) and cmake > files are there, where I've tried setting the system env variables Snappy_DIR > and Snappy_LIB to point at them, but to no avail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17425) [R] lubridate::as_datetime() etc. in dplyr query should be able to handle time in sub seconds
[ https://issues.apache.org/jira/browse/ARROW-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17425: -- Summary: [R] lubridate::as_datetime() etc. in dplyr query should be able to handle time in sub seconds (was: [R] [R] lubridate::as_datetime() etc. in dplyr query should be able to handle time in sub seconds) > [R] lubridate::as_datetime() etc. in dplyr query should be able to handle > time in sub seconds > - > > Key: ARROW-17425 > URL: https://issues.apache.org/jira/browse/ARROW-17425 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > Since the current unit is fixed to "s", an error will occur if a time > containing sub-seconds is given. > {code:r} > "1970-01-01T00:00:59.123456789" |> > data.frame(x = _) |> > arrow::arrow_table() |> > dplyr::mutate(x = lubridate::as_datetime(x, tz = "UTC")) |> > dplyr::collect() > #> Error in `dplyr::collect()`: > #> ! Invalid: Failed to parse string: '1970-01-01T00:00:59.123456789' as a > scalar of type timestamp[s] > {code} > I thought that nanoseconds should be used, but it should be noted that > POSIXct is currently supposed to be converted to microseconds, as shown in > ARROW-17424. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17424) [R] Microsecond is not sufficient units for POSIXct
[ https://issues.apache.org/jira/browse/ARROW-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17424: -- Summary: [R] Microsecond is not sufficient units for POSIXct (was: [R] Microseconds are not sufficient units for POSIXct) > [R] Microsecond is not sufficient units for POSIXct > --- > > Key: ARROW-17424 > URL: https://issues.apache.org/jira/browse/ARROW-17424 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > I believe the {{POSIXct}} type or R currently corresponds to the Arrow > {{timestamp[us, tz=UTC]}} type. > {code:r} > lubridate::as_datetime(0) |> arrow::infer_type() > #> Timestamp > #> timestamp[us, tz=UTC] > {code} > However, as shown below, POSIXct may hold data finer than a microsecond. > {code:r} > lubridate::as_datetime(0.1) |> as.numeric() > #> [1] 1e-09 > lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric() > #> [1] 1.192093e-07 > {code} > I don't know why it is currently set in microseconds, but is there any reason > not to set it in nanoseconds? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17424) [R] Microsecond is not sufficient unit for POSIXct
[ https://issues.apache.org/jira/browse/ARROW-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17424: -- Summary: [R] Microsecond is not sufficient unit for POSIXct (was: [R] Microsecond is not sufficient units for POSIXct) > [R] Microsecond is not sufficient unit for POSIXct > -- > > Key: ARROW-17424 > URL: https://issues.apache.org/jira/browse/ARROW-17424 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > I believe the {{POSIXct}} type or R currently corresponds to the Arrow > {{timestamp[us, tz=UTC]}} type. > {code:r} > lubridate::as_datetime(0) |> arrow::infer_type() > #> Timestamp > #> timestamp[us, tz=UTC] > {code} > However, as shown below, POSIXct may hold data finer than a microsecond. > {code:r} > lubridate::as_datetime(0.1) |> as.numeric() > #> [1] 1e-09 > lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric() > #> [1] 1.192093e-07 > {code} > I don't know why it is currently set in microseconds, but is there any reason > not to set it in nanoseconds? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17425) [R] [R] lubridate::as_datetime() etc. in dplyr query should be able to handle time in sub seconds
SHIMA Tatsuya created ARROW-17425: - Summary: [R] [R] lubridate::as_datetime() etc. in dplyr query should be able to handle time in sub seconds Key: ARROW-17425 URL: https://issues.apache.org/jira/browse/ARROW-17425 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 9.0.0 Reporter: SHIMA Tatsuya Since the current unit is fixed to "s", an error will occur if a time containing sub-seconds is given. {code:r} "1970-01-01T00:00:59.123456789" |> data.frame(x = _) |> arrow::arrow_table() |> dplyr::mutate(x = lubridate::as_datetime(x, tz = "UTC")) |> dplyr::collect() #> Error in `dplyr::collect()`: #> ! Invalid: Failed to parse string: '1970-01-01T00:00:59.123456789' as a scalar of type timestamp[s] {code} I thought that nanoseconds should be used, but it should be noted that POSIXct is currently supposed to be converted to microseconds, as shown in ARROW-17424. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17424) [R] Microseconds are not sufficient units for POSIXct
SHIMA Tatsuya created ARROW-17424: - Summary: [R] Microseconds are not sufficient units for POSIXct Key: ARROW-17424 URL: https://issues.apache.org/jira/browse/ARROW-17424 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 9.0.0 Reporter: SHIMA Tatsuya I believe the {{POSIXct}} type or R currently corresponds to the Arrow {{timestamp[us, tz=UTC]}} type. {code:r} lubridate::as_datetime(0) |> arrow::infer_type() #> Timestamp #> timestamp[us, tz=UTC] {code} However, as shown below, POSIXct may hold data finer than a microsecond. {code:r} lubridate::as_datetime(0.1) |> as.numeric() #> [1] 1e-09 lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric() #> [1] 1.192093e-07 {code} I don't know why it is currently set in microseconds, but is there any reason not to set it in nanoseconds? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17417) [R] Implement typeof() in Arrow dplyr queries
SHIMA Tatsuya created ARROW-17417: - Summary: [R] Implement typeof() in Arrow dplyr queries Key: ARROW-17417 URL: https://issues.apache.org/jira/browse/ARROW-17417 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 9.0.0 Reporter: SHIMA Tatsuya Currently this is useless because it always returns the string {{environment}}, but in dbplyr, duckdb and others have the typeof function, so it works as is. {code:r} dplyr::starwars |> dplyr::transmute(x = typeof(films)) #> # A tibble: 87 × 1 #>x #> #> 1 list #> 2 list #> 3 list #> 4 list #> 5 list #> 6 list #> 7 list #> 8 list #> 9 list #> 10 list #> # … with 77 more rows #> # ℹ Use `print(n = ...)` to see more rows dplyr::starwars |> arrow::to_duckdb() |> dplyr::transmute(x = typeof(films)) |> dplyr::collect() #> # A tibble: 87 × 1 #>x #> #> 1 VARCHAR[] #> 2 VARCHAR[] #> 3 VARCHAR[] #> 4 VARCHAR[] #> 5 VARCHAR[] #> 6 VARCHAR[] #> 7 VARCHAR[] #> 8 VARCHAR[] #> 9 VARCHAR[] #> 10 VARCHAR[] #> # … with 77 more rows #> # ℹ Use `print(n = ...)` to see more rows dplyr::starwars |> arrow::arrow_table() |> dplyr::transmute(x = typeof(films)) |> dplyr::collect() #> # A tibble: 87 × 1 #>x #> #> 1 environment #> 2 environment #> 3 environment #> 4 environment #> 5 environment #> 6 environment #> 7 environment #> 8 environment #> 9 environment #> 10 environment #> # … with 77 more rows #> # ℹ Use `print(n = ...)` to see more rows {code} I would expect it to work as follows. {code:r} dplyr::starwars |> arrow::arrow_table() |> dplyr::transmute(x = arrow::infer_type(films)$ToString()) |> dplyr::collect() #> # A tibble: 87 × 1 #>x #> #> 1 list #> 2 list #> 3 list #> 4 list #> 5 list #> 6 list #> 7 list #> 8 list #> 9 list #> 10 list #> # … with 77 more rows #> # ℹ Use `print(n = ...)` to see more rows {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17415) [R] Implement lubridate::with_tz and lubridate::force_tz
[ https://issues.apache.org/jira/browse/ARROW-17415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya closed ARROW-17415. - Resolution: Duplicate Sorry, but it looks like I sent it twice with a network error. > [R] Implement lubridate::with_tz and lubridate::force_tz > > > Key: ARROW-17415 > URL: https://issues.apache.org/jira/browse/ARROW-17415 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 9.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17415) [R] Implement lubridate::with_tz and lubridate::force_tz
SHIMA Tatsuya created ARROW-17415: - Summary: [R] Implement lubridate::with_tz and lubridate::force_tz Key: ARROW-17415 URL: https://issues.apache.org/jira/browse/ARROW-17415 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 9.0.0 Reporter: SHIMA Tatsuya -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17416) [R] Implement lubridate::with_tz and lubridate::force_tz
SHIMA Tatsuya created ARROW-17416: - Summary: [R] Implement lubridate::with_tz and lubridate::force_tz Key: ARROW-17416 URL: https://issues.apache.org/jira/browse/ARROW-17416 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 9.0.0 Reporter: SHIMA Tatsuya -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17414) [R]: Lack of `assume_timezone` binding
SHIMA Tatsuya created ARROW-17414: - Summary: [R]: Lack of `assume_timezone` binding Key: ARROW-17414 URL: https://issues.apache.org/jira/browse/ARROW-17414 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 9.0.0 Reporter: SHIMA Tatsuya If we run the following code in R, we will get a C++ derived error message telling us to use {{assume_timezone}}. However, this error message is not helpful because there is no binding for the {{assume_timezone}} function in R. {code:r} tf <- tempfile() writeLines("2004-04-01 12:00", tf) arrow::read_csv_arrow(tf, schema = arrow::schema(col1 = arrow::timestamp("s", "UTC"))) #> Error: #> ! Invalid: In CSV column #0: CSV conversion error to timestamp[s, tz=UTC]: expected a zone offset in '2004-04-01 12:00'. If these timestamps are in local time, parse them as timestamps without timezone, then call assume_timezone. #> ℹ If you have supplied a schema and your data contains a header row, you should supply the argument `skip = 1` to prevent the header being read in as data. {code} It would be useful to improve the error message or to allow {{assume_timezone}} to be used from R as well. (although {{lubridate::with_tz()}} and {{lubridate::force_tz()}} could be more useful within a dplyr query) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16318) [R]Timezone is not supported by to_duckdb()
[ https://issues.apache.org/jira/browse/ARROW-16318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579647#comment-17579647 ] SHIMA Tatsuya commented on ARROW-16318: --- I believe this is resolved by duckdb 0.4.0. > [R]Timezone is not supported by to_duckdb() > --- > > Key: ARROW-16318 > URL: https://issues.apache.org/jira/browse/ARROW-16318 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 7.0.0 >Reporter: Zsolt Kegyes-Brassai >Priority: Minor > > Here is a reproducible example: > > {code:java} > library(tidyverse) > library(arrow) > df1 <- tibble(time = lubridate::now(tzone = "UTC")) > str(df1) > #> tibble [1 x 1] (S3: tbl_df/tbl/data.frame) > #> $ time: POSIXct[1:1], format: "2022-04-25 12:50:10" > write_dataset(df1, here::here("temp/df1"), format = "parquet") > open_dataset(here::here("temp/df1")) |> > to_duckdb() > #> Error: duckdb_prepare_R: Failed to prepare query SELECT * > #> FROM "arrow_001" AS "q01" > #> WHERE (0 = 1) > #> Error: Not implemented Error: Unsupported Internal Arrow Type tsu:UTC > df2 <- tibble(time = lubridate::now()) > str(df2) > #> tibble [1 x 1] (S3: tbl_df/tbl/data.frame) > #> $ time: POSIXct[1:1], format: "2022-04-25 14:50:11" > write_dataset(df2, here::here("temp/df2"), format = "parquet") > open_dataset(here::here("temp/df2")) |> > to_duckdb() > #> # Source: table [?? x 1] > #> # Database: duckdb_connection > #> time > #> > #> 1 2022-04-25 12:50:11 > {code} > > The timestamps without timezone information are working fine. > How one can remove easily the timezone information from {{timestamp }}type > column from a parquet dataset? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15602) [R][Docs] Update docs to explain how to read timestamp with timezone columns
[ https://issues.apache.org/jira/browse/ARROW-15602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-15602: -- Summary: [R][Docs] Update docs to explain how to read timestamp with timezone columns (was: [R] Update docs to explain how to read timestamp with timezone columns) > [R][Docs] Update docs to explain how to read timestamp with timezone columns > > > Key: ARROW-15602 > URL: https://issues.apache.org/jira/browse/ARROW-15602 > Project: Apache Arrow > Issue Type: Bug > Components: R > Environment: R version 4.1.2 (2021-11-01) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 20.04.3 LTS >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The following values in a csv file can be read as timestamp by > `pyarrow.csv.read_csv` and `readr::read_csv`, but not by > `arrow::read_csv_arrow`. > {code} > "x" > "2004-04-01T12:00+09:00" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-15602) [R] Update docs to explain how to read timestamp with timezone columns
[ https://issues.apache.org/jira/browse/ARROW-15602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya reassigned ARROW-15602: - Assignee: SHIMA Tatsuya > [R] Update docs to explain how to read timestamp with timezone columns > -- > > Key: ARROW-15602 > URL: https://issues.apache.org/jira/browse/ARROW-15602 > Project: Apache Arrow > Issue Type: Bug > Components: R > Environment: R version 4.1.2 (2021-11-01) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 20.04.3 LTS >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > > The following values in a csv file can be read as timestamp by > `pyarrow.csv.read_csv` and `readr::read_csv`, but not by > `arrow::read_csv_arrow`. > {code} > "x" > "2004-04-01T12:00+09:00" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15602) [R] Update docs to explain how to read timestamp with timezone columns
[ https://issues.apache.org/jira/browse/ARROW-15602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-15602: -- Summary: [R] Update docs to explain how to read timestamp with timezone columns (was: [R] can't read timestamp with timezone from CSV (or other delimited) file without options) > [R] Update docs to explain how to read timestamp with timezone columns > -- > > Key: ARROW-15602 > URL: https://issues.apache.org/jira/browse/ARROW-15602 > Project: Apache Arrow > Issue Type: Bug > Components: R > Environment: R version 4.1.2 (2021-11-01) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 20.04.3 LTS >Reporter: SHIMA Tatsuya >Priority: Major > > The following values in a csv file can be read as timestamp by > `pyarrow.csv.read_csv` and `readr::read_csv`, but not by > `arrow::read_csv_arrow`. > {code} > "x" > "2004-04-01T12:00+09:00" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15602) [R] can't read timestamp with timezone from CSV (or other delimited) file without options
[ https://issues.apache.org/jira/browse/ARROW-15602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579539#comment-17579539 ] SHIMA Tatsuya commented on ARROW-15602: --- I may have misunderstood something. So far the following example seems to work. An update to the documentation may be sufficient. {code:r} tf <- tempfile() writeLines("x\n2004-04-01T12:00+09:00", tf) arrow::read_csv_arrow(tf) #> # A tibble: 1 × 1 #> x #> #> 1 2004-04-01 03:00:00 {code} > [R] can't read timestamp with timezone from CSV (or other delimited) file > without options > - > > Key: ARROW-15602 > URL: https://issues.apache.org/jira/browse/ARROW-15602 > Project: Apache Arrow > Issue Type: Bug > Components: R > Environment: R version 4.1.2 (2021-11-01) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 20.04.3 LTS >Reporter: SHIMA Tatsuya >Priority: Major > > The following values in a csv file can be read as timestamp by > `pyarrow.csv.read_csv` and `readr::read_csv`, but not by > `arrow::read_csv_arrow`. > {code} > "x" > "2004-04-01T12:00+09:00" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15602) [R] can't read timestamp with timezone from CSV (or other delimited) file without options
[ https://issues.apache.org/jira/browse/ARROW-15602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-15602: -- Summary: [R] can't read timestamp with timezone from CSV (or other delimited) file without options (was: [R] Update docs to explain how to specify timezone in CSV parsing) > [R] can't read timestamp with timezone from CSV (or other delimited) file > without options > - > > Key: ARROW-15602 > URL: https://issues.apache.org/jira/browse/ARROW-15602 > Project: Apache Arrow > Issue Type: Bug > Components: R > Environment: R version 4.1.2 (2021-11-01) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 20.04.3 LTS >Reporter: SHIMA Tatsuya >Priority: Major > > The following values in a csv file can be read as timestamp by > `pyarrow.csv.read_csv` and `readr::read_csv`, but not by > `arrow::read_csv_arrow`. > {code} > "x" > "2004-04-01T12:00+09:00" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17092) [Docs] Add note about "Feather" to the IPC file format document
[ https://issues.apache.org/jira/browse/ARROW-17092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya reassigned ARROW-17092: - Assignee: SHIMA Tatsuya > [Docs] Add note about "Feather" to the IPC file format document > --- > > Key: ARROW-17092 > URL: https://issues.apache.org/jira/browse/ARROW-17092 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Affects Versions: 8.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > The IPC file format is often referred to as Feather (especially in relation > to Python and R), but beginners are confused because the word "Feather" does > not appear on the IPC file format documentation. > https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format > Note: This ticket was created as a result of a conversation with [~kou] on > Twitter. > https://twitter.com/eitsupi/status/1547534742324920321 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17089) [Python] Use `.arrow` as extension for IPC file dataset
[ https://issues.apache.org/jira/browse/ARROW-17089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya reassigned ARROW-17089: - Assignee: SHIMA Tatsuya > [Python] Use `.arrow` as extension for IPC file dataset > --- > > Key: ARROW-17089 > URL: https://issues.apache.org/jira/browse/ARROW-17089 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 8.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > > Same as ARROW-17088 > As noted in the following document, the recommended extension for IPC files > is now `.arrow`. > > We recommend the “.arrow” extension for files created with this format. > https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format > However, currently when writing a dataset with the > {{pyarrow.dataset.write_dataset}} function, the default extension is > {{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected. > https://github.com/apache/arrow/blob/b8067151db9bfc04860285fdd8b5e73703346037/python/pyarrow/_dataset.pyx#L1149-L1151 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17143) [R] Add examples working with `tidyr::unnest`and `tidyr::unnest_longer`
[ https://issues.apache.org/jira/browse/ARROW-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17143: -- Description: Related to ARROW-8813 ARROW-12099 The arrow package can convert json files to data frames very easily, but {{tidyr::unnest_longer}} is needed for array expansion. Wonder if {{tidyr}} could be added to the recommended package and examples like this could be included in the documentation and test cases. {code:r} tf <- tempfile() on.exit(unlink(tf)) writeLines(' { "hello": 3.5, "world": false, "foo": { "bar": [ 1, 2 ] } } { "hello": 3.25, "world": null } { "hello": 0.0, "world": true, "foo": { "bar": [ 3, 4, 5 ] } } ', tf) arrow::read_json_arrow(tf) |> tidyr::unnest(foo, names_sep = ".") |> tidyr::unnest_longer(foo.bar) #> # A tibble: 6 × 3 #> hello world foo.bar #> #> 1 3.5 FALSE 1 #> 2 3.5 FALSE 2 #> 3 3.25 NA NA #> 4 0TRUE3 #> 5 0TRUE4 #> 6 0TRUE5 {code} was: Related to ARROW-8813 The arrow package can convert json files to data frames very easily, but {{tidyr::unnest_longer}} is needed for array expansion. Wonder if {{tidyr}} could be added to the recommended package and examples like this could be included in the documentation and test cases. {code:r} tf <- tempfile() on.exit(unlink(tf)) writeLines(' { "hello": 3.5, "world": false, "foo": { "bar": [ 1, 2 ] } } { "hello": 3.25, "world": null } { "hello": 0.0, "world": true, "foo": { "bar": [ 3, 4, 5 ] } } ', tf) arrow::read_json_arrow(tf) |> tidyr::unnest(foo, names_sep = ".") |> tidyr::unnest_longer(foo.bar) #> # A tibble: 6 × 3 #> hello world foo.bar #> #> 1 3.5 FALSE 1 #> 2 3.5 FALSE 2 #> 3 3.25 NA NA #> 4 0TRUE3 #> 5 0TRUE4 #> 6 0TRUE5 {code} > [R] Add examples working with `tidyr::unnest`and `tidyr::unnest_longer` > --- > > Key: ARROW-17143 > URL: https://issues.apache.org/jira/browse/ARROW-17143 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R >Affects Versions: 8.0.1 >Reporter: SHIMA Tatsuya >Priority: Major > > Related to ARROW-8813 ARROW-12099 > The arrow package can convert json files to data frames very easily, but > {{tidyr::unnest_longer}} is needed for array expansion. > Wonder if {{tidyr}} could be added to the recommended package and examples > like this could be included in the documentation and test cases. > {code:r} > tf <- tempfile() > on.exit(unlink(tf)) > writeLines(' > { "hello": 3.5, "world": false, "foo": { "bar": [ 1, 2 ] } } > { "hello": 3.25, "world": null } > { "hello": 0.0, "world": true, "foo": { "bar": [ 3, 4, 5 ] } } > ', tf) > arrow::read_json_arrow(tf) |> > tidyr::unnest(foo, names_sep = ".") |> > tidyr::unnest_longer(foo.bar) > #> # A tibble: 6 × 3 > #> hello world foo.bar > #> > #> 1 3.5 FALSE 1 > #> 2 3.5 FALSE 2 > #> 3 3.25 NA NA > #> 4 0TRUE3 > #> 5 0TRUE4 > #> 6 0TRUE5 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17143) [R] Add examples working with `tidyr::unnest`and `tidyr::unnest_longer`
SHIMA Tatsuya created ARROW-17143: - Summary: [R] Add examples working with `tidyr::unnest`and `tidyr::unnest_longer` Key: ARROW-17143 URL: https://issues.apache.org/jira/browse/ARROW-17143 Project: Apache Arrow Issue Type: Improvement Components: Documentation, R Affects Versions: 8.0.1 Reporter: SHIMA Tatsuya Related to ARROW-8813 The arrow package can convert json files to data frames very easily, but {{tidyr::unnest_longer}} is needed for array expansion. Wonder if {{tidyr}} could be added to the recommended package and examples like this could be included in the documentation and test cases. {code:r} tf <- tempfile() on.exit(unlink(tf)) writeLines(' { "hello": 3.5, "world": false, "foo": { "bar": [ 1, 2 ] } } { "hello": 3.25, "world": null } { "hello": 0.0, "world": true, "foo": { "bar": [ 3, 4, 5 ] } } ', tf) arrow::read_json_arrow(tf) |> tidyr::unnest(foo, names_sep = ".") |> tidyr::unnest_longer(foo.bar) #> # A tibble: 6 × 3 #> hello world foo.bar #> #> 1 3.5 FALSE 1 #> 2 3.5 FALSE 2 #> 3 3.25 NA NA #> 4 0TRUE3 #> 5 0TRUE4 #> 6 0TRUE5 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-8324) [R] Add read/write_ipc_file separate from _feather
[ https://issues.apache.org/jira/browse/ARROW-8324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya reassigned ARROW-8324: Assignee: SHIMA Tatsuya > [R] Add read/write_ipc_file separate from _feather > -- > > Key: ARROW-8324 > URL: https://issues.apache.org/jira/browse/ARROW-8324 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > See [https://github.com/apache/arrow/pull/6771#issuecomment-608133760] > {quote}Let's add read/write_ipc_file also? I'm wary of the "version" option > in "write_feather" and the Feather version inference capability in > "read_feather". It's potentially confusing and we may choose to add options > to write_ipc_file/read_ipc_file that are more developer centric, having to do > with particulars in the IPC format, that are not relevant or appropriate for > the Feather APIs. > IMHO it's best for "Feather format" to remain an abstracted higher-level > concept with its use of the "IPC file format" as an implementation detail, > and segregated from the other things. > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-8324) [R] Add read/write_ipc_file separate from _feather
[ https://issues.apache.org/jira/browse/ARROW-8324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567451#comment-17567451 ] SHIMA Tatsuya commented on ARROW-8324: -- Yes, I looked at this. Since the C++ library manages the Feather format version now, it seems easier to make {{write_ipc_file()}} a special case of {{write_feather()}} and {{read_ipc_file()}} simply an alias for {{read_feather()}} now. In the future, these functions could be updated as the C++ library is updated. > [R] Add read/write_ipc_file separate from _feather > -- > > Key: ARROW-8324 > URL: https://issues.apache.org/jira/browse/ARROW-8324 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > See [https://github.com/apache/arrow/pull/6771#issuecomment-608133760] > {quote}Let's add read/write_ipc_file also? I'm wary of the "version" option > in "write_feather" and the Feather version inference capability in > "read_feather". It's potentially confusing and we may choose to add options > to write_ipc_file/read_ipc_file that are more developer centric, having to do > with particulars in the IPC format, that are not relevant or appropriate for > the Feather APIs. > IMHO it's best for "Feather format" to remain an abstracted higher-level > concept with its use of the "IPC file format" as an implementation detail, > and segregated from the other things. > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17089) [Python] Use `.arrow` as extension for IPC file dataset
[ https://issues.apache.org/jira/browse/ARROW-17089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17089: -- Description: Same as ARROW-17088 As noted in the following document, the recommended extension for IPC files is now `.arrow`. > We recommend the “.arrow” extension for files created with this format. https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format However, currently when writing a dataset with the {{pyarrow.dataset.write_dataset}} function, the default extension is {{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected. https://github.com/apache/arrow/blob/b8067151db9bfc04860285fdd8b5e73703346037/python/pyarrow/_dataset.pyx#L1149-L1151 was: Same as ARROW-17088 As noted in the following document, the recommended extension for IPC files is now `.arrow`. > We recommend the “.arrow” extension for files created with this format. https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format However, currently when writing a dataset with the {{pyarrow.dataset.write_dataset}} function, the default extension is {{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected. https://github.com/apache/arrow/blob/b8067151db9bfc04860285fdd8b5e73703346037/python/pyarrow/_dataset.pyx#L1149-L1151 > [Python] Use `.arrow` as extension for IPC file dataset > --- > > Key: ARROW-17089 > URL: https://issues.apache.org/jira/browse/ARROW-17089 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 8.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > Same as ARROW-17088 > As noted in the following document, the recommended extension for IPC files > is now `.arrow`. > > We recommend the “.arrow” extension for files created with this format. > https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format > However, currently when writing a dataset with the > {{pyarrow.dataset.write_dataset}} function, the default extension is > {{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected. > https://github.com/apache/arrow/blob/b8067151db9bfc04860285fdd8b5e73703346037/python/pyarrow/_dataset.pyx#L1149-L1151 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17089) [Python] Use `.arrow` as extension for IPC file dataset
[ https://issues.apache.org/jira/browse/ARROW-17089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17089: -- Description: Same as ARROW-17088 As noted in the following document, the recommended extension for IPC files is now `.arrow`. > We recommend the “.arrow” extension for files created with this format. https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format However, currently when writing a dataset with the {{pyarrow.dataset.write_dataset}} function, the default extension is {{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected. https://github.com/apache/arrow/blob/b8067151db9bfc04860285fdd8b5e73703346037/python/pyarrow/_dataset.pyx#L1149-L1151 was: Same as ARROW-17088 As noted in the following document, the recommended extension for IPC files is now `.arrow`. > We recommend the “.arrow” extension for files created with this format. https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format However, currently when writing a dataset with the {{pyarrow.dataset.write_dataset}} function, the default extension is {{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected. > [Python] Use `.arrow` as extension for IPC file dataset > --- > > Key: ARROW-17089 > URL: https://issues.apache.org/jira/browse/ARROW-17089 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 8.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > Same as ARROW-17088 > As noted in the following document, the recommended extension for IPC files > is now `.arrow`. > > We recommend the “.arrow” extension for files created with this format. > https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format > However, currently when writing a dataset with the > {{pyarrow.dataset.write_dataset}} function, the default extension is > {{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected. > https://github.com/apache/arrow/blob/b8067151db9bfc04860285fdd8b5e73703346037/python/pyarrow/_dataset.pyx#L1149-L1151 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17092) [Docs] Add note about "Feather" to the IPC file format document
SHIMA Tatsuya created ARROW-17092: - Summary: [Docs] Add note about "Feather" to the IPC file format document Key: ARROW-17092 URL: https://issues.apache.org/jira/browse/ARROW-17092 Project: Apache Arrow Issue Type: Improvement Components: Documentation Affects Versions: 8.0.0 Reporter: SHIMA Tatsuya The IPC file format is often referred to as Feather (especially in relation to Python and R), but beginners are confused because the word "Feather" does not appear on the IPC file format documentation. https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format Note: This ticket was created as a result of a conversation with [~kou] on Twitter. https://twitter.com/eitsupi/status/1547534742324920321 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17089) [Python] Use `.arrow` as extension for IPC file dataset
SHIMA Tatsuya created ARROW-17089: - Summary: [Python] Use `.arrow` as extension for IPC file dataset Key: ARROW-17089 URL: https://issues.apache.org/jira/browse/ARROW-17089 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 8.0.0 Reporter: SHIMA Tatsuya Same as ARROW-17088 As noted in the following document, the recommended extension for IPC files is now `.arrow`. > We recommend the “.arrow” extension for files created with this format. https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format However, currently when writing a dataset with the {{pyarrow.dataset.write_dataset}} function, the default extension is {{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17088) [R] Use `.arrow` as extension of IPC files of datasets
SHIMA Tatsuya created ARROW-17088: - Summary: [R] Use `.arrow` as extension of IPC files of datasets Key: ARROW-17088 URL: https://issues.apache.org/jira/browse/ARROW-17088 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 8.0.0 Reporter: SHIMA Tatsuya Related to ARROW-17072 As noted in the following document, the recommended extension for IPC files is now `.arrow`. > We recommend the “.arrow” extension for files created with this format. https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format However, currently when writing a dataset with the {{write_dataset}} function, the default extension is {{.feather}} when {{feather}} is selected as the format, and {{.ipc}} when {{ipc}} is selected. https://github.com/apache/arrow/blob/f295da4cfdcf102d9ac2d16bbca6f8342fc3e6a8/r/R/dataset-write.R#L124-L126 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17085) [R] group_vars() should not return NULL
[ https://issues.apache.org/jira/browse/ARROW-17085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya reassigned ARROW-17085: - Assignee: SHIMA Tatsuya > [R] group_vars() should not return NULL > --- > > Key: ARROW-17085 > URL: https://issues.apache.org/jira/browse/ARROW-17085 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 8.0.0 >Reporter: SHIMA Tatsuya >Assignee: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > {code:r} > mtcars |> dplyr::group_vars() > #> character(0) > mtcars |> arrow:::as_adq() |> dplyr::group_vars() > #> character(0) > mtcars |> arrow::arrow_table() |> dplyr::group_vars() > #> NULL > {code} > {{dplyr::group_vars()}} does not return NULL, so the following > code will result in an error. > {code:r} > mtcars |> arrow::arrow_table() |> dtplyr::lazy_dt() > #> Error in new_step(parent, vars = names(parent), groups = groups, locals = > list(), : is.character(groups) is not TRUE > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17085) [R] group_vars() should not return NULL
[ https://issues.apache.org/jira/browse/ARROW-17085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-17085: -- Summary: [R] group_vars() should not return NULL (was: [R] group_vars() returns NULL) > [R] group_vars() should not return NULL > --- > > Key: ARROW-17085 > URL: https://issues.apache.org/jira/browse/ARROW-17085 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 8.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > {code:r} > mtcars |> dplyr::group_vars() > #> character(0) > mtcars |> arrow:::as_adq() |> dplyr::group_vars() > #> character(0) > mtcars |> arrow::arrow_table() |> dplyr::group_vars() > #> NULL > {code} > {{dplyr::group_vars()}} does not return NULL, so the following > code will result in an error. > {code:r} > mtcars |> arrow::arrow_table() |> dtplyr::lazy_dt() > #> Error in new_step(parent, vars = names(parent), groups = groups, locals = > list(), : is.character(groups) is not TRUE > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17085) [R] group_vars() returns NULL
[ https://issues.apache.org/jira/browse/ARROW-17085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567235#comment-17567235 ] SHIMA Tatsuya commented on ARROW-17085: --- > You may also want to change groups() to return an empty list() instead of > NULL. I was not aware of this. I will take a look at this too. > [R] group_vars() returns NULL > --- > > Key: ARROW-17085 > URL: https://issues.apache.org/jira/browse/ARROW-17085 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 8.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > {code:r} > mtcars |> dplyr::group_vars() > #> character(0) > mtcars |> arrow:::as_adq() |> dplyr::group_vars() > #> character(0) > mtcars |> arrow::arrow_table() |> dplyr::group_vars() > #> NULL > {code} > {{dplyr::group_vars()}} does not return NULL, so the following > code will result in an error. > {code:r} > mtcars |> arrow::arrow_table() |> dtplyr::lazy_dt() > #> Error in new_step(parent, vars = names(parent), groups = groups, locals = > list(), : is.character(groups) is not TRUE > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17085) [R] group_vars() returns NULL
[ https://issues.apache.org/jira/browse/ARROW-17085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567233#comment-17567233 ] SHIMA Tatsuya commented on ARROW-17085: --- Yes, I was trying that work. Will send PR after this. > [R] group_vars() returns NULL > --- > > Key: ARROW-17085 > URL: https://issues.apache.org/jira/browse/ARROW-17085 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 8.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > {code:r} > mtcars |> dplyr::group_vars() > #> character(0) > mtcars |> arrow:::as_adq() |> dplyr::group_vars() > #> character(0) > mtcars |> arrow::arrow_table() |> dplyr::group_vars() > #> NULL > {code} > {{dplyr::group_vars()}} does not return NULL, so the following > code will result in an error. > {code:r} > mtcars |> arrow::arrow_table() |> dtplyr::lazy_dt() > #> Error in new_step(parent, vars = names(parent), groups = groups, locals = > list(), : is.character(groups) is not TRUE > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17085) [R] group_vars() returns NULL
SHIMA Tatsuya created ARROW-17085: - Summary: [R] group_vars() returns NULL Key: ARROW-17085 URL: https://issues.apache.org/jira/browse/ARROW-17085 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 8.0.0 Reporter: SHIMA Tatsuya {code:r} mtcars |> dplyr::group_vars() #> character(0) mtcars |> arrow:::as_adq() |> dplyr::group_vars() #> character(0) mtcars |> arrow::arrow_table() |> dplyr::group_vars() #> NULL {code} {{dplyr::group_vars()}} does not return NULL, so the following code will result in an error. {code:r} mtcars |> arrow::arrow_table() |> dtplyr::lazy_dt() #> Error in new_step(parent, vars = names(parent), groups = groups, locals = list(), : is.character(groups) is not TRUE {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17072) [R] Rename *_feather functions
[ https://issues.apache.org/jira/browse/ARROW-17072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567175#comment-17567175 ] SHIMA Tatsuya commented on ARROW-17072: --- Thank you both and I agree to close this in favor of ARROW-8324. Perhaps after adding *_ipc_file functions, the various documents need to be updated to recommend the use of _ipc_file functions. > [R] Rename *_feather functions > -- > > Key: ARROW-17072 > URL: https://issues.apache.org/jira/browse/ARROW-17072 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 8.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > As noted in the following document, the recommended extension for IPC files > is now `.arrow`. > > We recommend the “.arrow” extension for files created with this format. > https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format > However, the R library cannot read IPC files without using the `read_feather` > function after ARROW-16268. > I think users will be confused if you keep using this function name because > the word `feather' is not associated with `.arrow` for beginners. > For example, could we deprecate function `read_feather` and recommend another > function like `read_ipc_file`, which has the same functionality? > Note: This ticket was created as a result of a conversation with [~kou] on > Twitter. > https://twitter.com/ktou/status/1547373388687376386 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17072) [R] Rename *_feather functions
[ https://issues.apache.org/jira/browse/ARROW-17072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567175#comment-17567175 ] SHIMA Tatsuya edited comment on ARROW-17072 at 7/15/22 10:05 AM: - Thank you both and I agree to close this in favor of ARROW-8324. Perhaps after adding *_ipc_file functions, the various documents need to be updated to recommend the use of *_ipc_file functions. was (Author: JIRAUSER280211): Thank you both and I agree to close this in favor of ARROW-8324. Perhaps after adding *_ipc_file functions, the various documents need to be updated to recommend the use of _ipc_file functions. > [R] Rename *_feather functions > -- > > Key: ARROW-17072 > URL: https://issues.apache.org/jira/browse/ARROW-17072 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 8.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > As noted in the following document, the recommended extension for IPC files > is now `.arrow`. > > We recommend the “.arrow” extension for files created with this format. > https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format > However, the R library cannot read IPC files without using the `read_feather` > function after ARROW-16268. > I think users will be confused if you keep using this function name because > the word `feather' is not associated with `.arrow` for beginners. > For example, could we deprecate function `read_feather` and recommend another > function like `read_ipc_file`, which has the same functionality? > Note: This ticket was created as a result of a conversation with [~kou] on > Twitter. > https://twitter.com/ktou/status/1547373388687376386 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17072) [R] Rename *_feather functions
SHIMA Tatsuya created ARROW-17072: - Summary: [R] Rename *_feather functions Key: ARROW-17072 URL: https://issues.apache.org/jira/browse/ARROW-17072 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 8.0.0 Reporter: SHIMA Tatsuya As noted in the following document, the recommended extension for IPC files is now `.arrow`. > We recommend the “.arrow” extension for files created with this format. https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format However, the R library cannot read IPC files without using the `read_feather` function after ARROW-16268. I think users will be confused if you keep using this function name because the word `feather' is not associated with `.arrow` for beginners. For example, could we deprecate function `read_feather` and recommend another function like `read_ipc_file`, which has the same functionality? Note: This ticket was created as a result of a conversation with [~kou] on Twitter. https://twitter.com/ktou/status/1547373388687376386 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-15816) [R][Docs] pkgdown config refactoring
[ https://issues.apache.org/jira/browse/ARROW-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya closed ARROW-15816. - Resolution: Won't Fix > [R][Docs] pkgdown config refactoring > > > Key: ARROW-15816 > URL: https://issues.apache.org/jira/browse/ARROW-15816 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Time Spent: 4h > Remaining Estimate: 0h > > Part of ARROW-15734 > Need to change the configuration of the pkgdown site which is not compatible > with bootstrap5. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16038) [R] different behavior from dplyr when mutate's `.keep` option is set
SHIMA Tatsuya created ARROW-16038: - Summary: [R] different behavior from dplyr when mutate's `.keep` option is set Key: ARROW-16038 URL: https://issues.apache.org/jira/browse/ARROW-16038 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 7.0.0 Reporter: SHIMA Tatsuya The order of columns when `dplyr::mutate`'s `.keep` option is set to "none", etc. has been changed in dplyr 1.0.8 and differs from the current behavior of the arrow package. For more information, please see the following issues. https://github.com/tidyverse/dplyr/pull/6035 https://github.com/tidyverse/dplyr/issues/6086 https://github.com/tidyverse/dplyr/pull/6087 {code:r} library(dplyr, warn.conflicts = FALSE) df <- tibble::tibble(x = 1:3, y = 4:6) df |> transmute(x, z = x + 1, y) #> # A tibble: 3 × 3 #> x z y #> #> 1 1 2 4 #> 2 2 3 5 #> 3 3 4 6 df |> mutate(x, z = x + 1, y, .keep = "none") #> # A tibble: 3 × 3 #> x y z #> #> 1 1 4 2 #> 2 2 5 3 #> 3 3 6 4 df |> arrow::arrow_table() |> mutate(x, z = x + 1, y, .keep = "none") |> collect() #> # A tibble: 3 × 3 #> x z y #> #> 1 1 2 4 #> 2 2 3 5 #> 3 3 4 6 {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15602) [R] can't read timestamp with timezone from CSV (or other delimited) file
[ https://issues.apache.org/jira/browse/ARROW-15602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502846#comment-17502846 ] SHIMA Tatsuya commented on ARROW-15602: --- I believe this is the format defined in ISO8601. Don't you think libarrow's ISO8601 parser can handle this? In fact, it seems to be handled by pyarrow. (Please see my comments above.) > [R] can't read timestamp with timezone from CSV (or other delimited) file > - > > Key: ARROW-15602 > URL: https://issues.apache.org/jira/browse/ARROW-15602 > Project: Apache Arrow > Issue Type: Bug > Components: R > Environment: R version 4.1.2 (2021-11-01) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 20.04.3 LTS >Reporter: SHIMA Tatsuya >Priority: Major > > The following values in a csv file can be read as timestamp by > `pyarrow.csv.read_csv` and `readr::read_csv`, but not by > `arrow::read_csv_arrow`. > {code} > "x" > "2004-04-01T12:00+09:00" > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15828) [Python][R] ChunkedArray's cast() method combine multiple arrays into one
[ https://issues.apache.org/jira/browse/ARROW-15828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502168#comment-17502168 ] SHIMA Tatsuya commented on ARROW-15828: --- Thanks for the reply. My question was whether it is normal for the chunks to behave differently depending on the cast destination, whether they remain split into chunks or become one continuous chunk. Does your guess explain that the chunks are only contiguous if the type of the cast destination is numeric? I simply found this behavior while playing around and would appreciate it if you could close this if this is the intended behavior. > [Python][R] ChunkedArray's cast() method combine multiple arrays into one > - > > Key: ARROW-15828 > URL: https://issues.apache.org/jira/browse/ARROW-15828 > Project: Apache Arrow > Issue Type: Bug > Components: Python, R >Affects Versions: 7.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > It appears that if I try to cast to int or float, the array will be one. > {code:r} > library(arrow, warn.conflicts = FALSE) > #> See arrow_info() for available features > chunked_array(1:2, 3:4, 5:6)$cast(string()) > #> ChunkedArray > #> [ > #> [ > #> "1", > #> "2" > #> ], > #> [ > #> "3", > #> "4" > #> ], > #> [ > #> "5", > #> "6" > #> ] > #> ] > chunked_array(1:2, 3:4, 5:6)$cast(float64()) > #> ChunkedArray > #> [ > #> [ > #> 1, > #> 2, > #> 3, > #> 4, > #> 5, > #> 6 > #> ] > #> ] > chunked_array(1:2, 3:4, 5:6)$cast(int64()) > #> ChunkedArray > #> [ > #> [ > #> 1, > #> 2, > #> 3, > #> 4, > #> 5, > #> 6 > #> ] > #> ] > chunked_array(1:2, 3:4, 5:6)$cast(date32()) > #> ChunkedArray > #> [ > #> [ > #> 1970-01-02, > #> 1970-01-03 > #> ], > #> [ > #> 1970-01-04, > #> 1970-01-05 > #> ], > #> [ > #> 1970-01-06, > #> 1970-01-07 > #> ] > #> ] > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15828) [Python][R] ChunkedArray's cast() method combine multiple arrays into one
[ https://issues.apache.org/jira/browse/ARROW-15828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-15828: -- Summary: [Python][R] ChunkedArray's cast() method combine multiple arrays into one (was: [R] ChunkedArray$cast() combine multiple arrays into one) > [Python][R] ChunkedArray's cast() method combine multiple arrays into one > - > > Key: ARROW-15828 > URL: https://issues.apache.org/jira/browse/ARROW-15828 > Project: Apache Arrow > Issue Type: Bug > Components: Python, R >Affects Versions: 7.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > It appears that if I try to cast to int or float, the array will be one. > {code:r} > library(arrow, warn.conflicts = FALSE) > #> See arrow_info() for available features > chunked_array(1:2, 3:4, 5:6)$cast(string()) > #> ChunkedArray > #> [ > #> [ > #> "1", > #> "2" > #> ], > #> [ > #> "3", > #> "4" > #> ], > #> [ > #> "5", > #> "6" > #> ] > #> ] > chunked_array(1:2, 3:4, 5:6)$cast(float64()) > #> ChunkedArray > #> [ > #> [ > #> 1, > #> 2, > #> 3, > #> 4, > #> 5, > #> 6 > #> ] > #> ] > chunked_array(1:2, 3:4, 5:6)$cast(int64()) > #> ChunkedArray > #> [ > #> [ > #> 1, > #> 2, > #> 3, > #> 4, > #> 5, > #> 6 > #> ] > #> ] > chunked_array(1:2, 3:4, 5:6)$cast(date32()) > #> ChunkedArray > #> [ > #> [ > #> 1970-01-02, > #> 1970-01-03 > #> ], > #> [ > #> 1970-01-04, > #> 1970-01-05 > #> ], > #> [ > #> 1970-01-06, > #> 1970-01-07 > #> ] > #> ] > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15828) [R] ChunkedArray$cast() combine multiple arrays into one
[ https://issues.apache.org/jira/browse/ARROW-15828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-15828: -- Component/s: Python > [R] ChunkedArray$cast() combine multiple arrays into one > > > Key: ARROW-15828 > URL: https://issues.apache.org/jira/browse/ARROW-15828 > Project: Apache Arrow > Issue Type: Bug > Components: Python, R >Affects Versions: 7.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > It appears that if I try to cast to int or float, the array will be one. > {code:r} > library(arrow, warn.conflicts = FALSE) > #> See arrow_info() for available features > chunked_array(1:2, 3:4, 5:6)$cast(string()) > #> ChunkedArray > #> [ > #> [ > #> "1", > #> "2" > #> ], > #> [ > #> "3", > #> "4" > #> ], > #> [ > #> "5", > #> "6" > #> ] > #> ] > chunked_array(1:2, 3:4, 5:6)$cast(float64()) > #> ChunkedArray > #> [ > #> [ > #> 1, > #> 2, > #> 3, > #> 4, > #> 5, > #> 6 > #> ] > #> ] > chunked_array(1:2, 3:4, 5:6)$cast(int64()) > #> ChunkedArray > #> [ > #> [ > #> 1, > #> 2, > #> 3, > #> 4, > #> 5, > #> 6 > #> ] > #> ] > chunked_array(1:2, 3:4, 5:6)$cast(date32()) > #> ChunkedArray > #> [ > #> [ > #> 1970-01-02, > #> 1970-01-03 > #> ], > #> [ > #> 1970-01-04, > #> 1970-01-05 > #> ], > #> [ > #> 1970-01-06, > #> 1970-01-07 > #> ] > #> ] > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15828) [R] ChunkedArray$cast() combine multiple arrays into one
[ https://issues.apache.org/jira/browse/ARROW-15828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500850#comment-17500850 ] SHIMA Tatsuya commented on ARROW-15828: --- I noticed that this is reproduced in Python as well. Is this the intended behavior? {code:python} >>> import pyarrow as pa >>> pa.chunked_array([pa.array([1,2]),pa.array([3,4])]).cast(pa.float64()) [ [ 1, 2, 3, 4 ] ] >>> pa.chunked_array([pa.array([1,2]),pa.array([3,4])]).cast(pa.utf8()) [ [ "1", "2" ], [ "3", "4" ] {code} > [R] ChunkedArray$cast() combine multiple arrays into one > > > Key: ARROW-15828 > URL: https://issues.apache.org/jira/browse/ARROW-15828 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 7.0.0 >Reporter: SHIMA Tatsuya >Priority: Major > > It appears that if I try to cast to int or float, the array will be one. > {code:r} > library(arrow, warn.conflicts = FALSE) > #> See arrow_info() for available features > chunked_array(1:2, 3:4, 5:6)$cast(string()) > #> ChunkedArray > #> [ > #> [ > #> "1", > #> "2" > #> ], > #> [ > #> "3", > #> "4" > #> ], > #> [ > #> "5", > #> "6" > #> ] > #> ] > chunked_array(1:2, 3:4, 5:6)$cast(float64()) > #> ChunkedArray > #> [ > #> [ > #> 1, > #> 2, > #> 3, > #> 4, > #> 5, > #> 6 > #> ] > #> ] > chunked_array(1:2, 3:4, 5:6)$cast(int64()) > #> ChunkedArray > #> [ > #> [ > #> 1, > #> 2, > #> 3, > #> 4, > #> 5, > #> 6 > #> ] > #> ] > chunked_array(1:2, 3:4, 5:6)$cast(date32()) > #> ChunkedArray > #> [ > #> [ > #> 1970-01-02, > #> 1970-01-03 > #> ], > #> [ > #> 1970-01-04, > #> 1970-01-05 > #> ], > #> [ > #> 1970-01-06, > #> 1970-01-07 > #> ] > #> ] > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15814) [R] Improve documentation for cast()
[ https://issues.apache.org/jira/browse/ARROW-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500224#comment-17500224 ] SHIMA Tatsuya commented on ARROW-15814: --- [~dragosmg] Thank you for giving me the opportunity. I have created a minimum PR for now. https://github.com/apache/arrow/pull/12546 Perhaps it is important too to touch on this in the article of dplyr as suggested in ARROW-14703. > [R] Improve documentation for cast() > > > Key: ARROW-15814 > URL: https://issues.apache.org/jira/browse/ARROW-15814 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: good-first-issue, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Originated in the > [comments|https://issues.apache.org/jira/browse/ARROW-14820?focusedCommentId=17498465&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17498465] > for ARROW-14820. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15828) [R] ChunkedArray$cast() combine multiple arrays into one
SHIMA Tatsuya created ARROW-15828: - Summary: [R] ChunkedArray$cast() combine multiple arrays into one Key: ARROW-15828 URL: https://issues.apache.org/jira/browse/ARROW-15828 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 7.0.0 Reporter: SHIMA Tatsuya It appears that if I try to cast to int or float, the array will be one. {code:r} library(arrow, warn.conflicts = FALSE) #> See arrow_info() for available features chunked_array(1:2, 3:4, 5:6)$cast(string()) #> ChunkedArray #> [ #> [ #> "1", #> "2" #> ], #> [ #> "3", #> "4" #> ], #> [ #> "5", #> "6" #> ] #> ] chunked_array(1:2, 3:4, 5:6)$cast(float64()) #> ChunkedArray #> [ #> [ #> 1, #> 2, #> 3, #> 4, #> 5, #> 6 #> ] #> ] chunked_array(1:2, 3:4, 5:6)$cast(int64()) #> ChunkedArray #> [ #> [ #> 1, #> 2, #> 3, #> 4, #> 5, #> 6 #> ] #> ] chunked_array(1:2, 3:4, 5:6)$cast(date32()) #> ChunkedArray #> [ #> [ #> 1970-01-02, #> 1970-01-03 #> ], #> [ #> 1970-01-04, #> 1970-01-05 #> ], #> [ #> 1970-01-06, #> 1970-01-07 #> ] #> ] {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15734) [R][Docs] Enable searching R docs
[ https://issues.apache.org/jira/browse/ARROW-15734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-15734: -- Component/s: R > [R][Docs] Enable searching R docs > - > > Key: ARROW-15734 > URL: https://issues.apache.org/jira/browse/ARROW-15734 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: SHIMA Tatsuya >Priority: Major > Labels: pull-request-available > Attachments: bs5.png, fixed-bs5.png, > image-2022-03-01-00-33-12-050.png, image-2022-03-01-00-46-51-350.png, > updated-list.png > > Time Spent: 40m > Remaining Estimate: 0h > > Enable Bootstrap 5 in pkgdown website to use the built-in search feature. > Do you have any plans to switch to Bootstrap 5? > https://pkgdown.r-lib.org/articles/search.html > https://pkgdown.r-lib.org/articles/customise.html -- This message was sent by Atlassian Jira (v8.20.1#820001)