[jira] [Commented] (ARROW-18123) [Python] Cannot use multi-byte characters in file names

2022-10-24 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623177#comment-17623177
 ] 

SHIMA Tatsuya commented on ARROW-18123:
---

Thanks for your comment.
But does it explain that relative paths can be used if they do not contain 
multibyte characters?
The sample code appears to use relative paths.

And, the documentation I am looking at does not seem to have a link to that 
detailed explanation.
[https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html]

> [Python] Cannot use multi-byte characters in file names
> ---
>
> Key: ARROW-18123
> URL: https://issues.apache.org/jira/browse/ARROW-18123
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Error when specifying a file path containing multi-byte characters in 
> {{pyarrow.parquet.write_table}}.
> For example, use {{例.parquet}} as the file path.
> {code:python}
> Python 3.10.7 (main, Oct  5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pandas as pd
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
> ...'two': ['foo', 'bar', 'baz'],
> ...'three': [True, False, True]},
> ...index=list('abc'))
> >>> table = pa.Table.from_pandas(df)
> >>> import pyarrow.parquet as pq
> >>> pq.write_table(table, '例.parquet')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 2920, in write_table
> with ParquetWriter(
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 911, in __init__
> filesystem, path = _resolve_filesystem_and_path(
>   File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line
> 184, in _resolve_filesystem_and_path
> filesystem, path = FileSystem.from_uri(path)
>   File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18123) [Python] Cannot use multi-byte characters in file names

2022-10-21 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-18123:
-

 Summary: [Python] Cannot use multi-byte characters in file names
 Key: ARROW-18123
 URL: https://issues.apache.org/jira/browse/ARROW-18123
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 9.0.0
Reporter: SHIMA Tatsuya


Error when specifying a file path containing multi-byte characters in 
{{pyarrow.parquet.write_table}}.

For example, use {{例.parquet}} as the file path.

{code:python}
Python 3.10.7 (main, Oct  5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> import pyarrow as pa
>>> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
...'two': ['foo', 'bar', 'baz'],
...'three': [True, False, True]},
...index=list('abc'))
>>> table = pa.Table.from_pandas(df)
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, '例.parquet')
Traceback (most recent call last):
  File "", line 1, in 
  File
"/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
line 2920, in write_table
with ParquetWriter(
  File
"/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
line 911, in __init__
filesystem, path = _resolve_filesystem_and_path(
  File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line
184, in _resolve_filesystem_and_path
filesystem, path = FileSystem.from_uri(path)
  File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
  File "pyarrow/error.pxi", line 144, in
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet'
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17737) [R] Groups before conversion to a Table must not be restored after `collect()`

2022-10-07 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17737:
--
Summary: [R] Groups before conversion to a Table must not be restored after 
`collect()`  (was: [R] Continue to retain grouping metadata even if ungroup 
arrow dplyr query)

> [R] Groups before conversion to a Table must not be restored after `collect()`
> --
>
> Key: ARROW-17737
> URL: https://issues.apache.org/jira/browse/ARROW-17737
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it 
> becomes an arrow dplyr query.
> And it must also be written back again when converted to a Table.
> {code:r}
> mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> 
> as.data.frame() |> dplyr::group_vars()
> #> character(0)
> mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> 
> as.data.frame() |> dplyr::group_vars()
> #> [1] "cyl"
> {code}
> {code:r}
> mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::group_by(vs, 
> .add = TRUE) |> dplyr::ungroup() |> dplyr::collect()
> #> # A tibble: 32 × 11
> #> # Groups:   cyl [3]
> #>  mpg   cyl  disphp  dratwt  qsecvsam  gear  carb
> #>  
> #>  1  21   6  160110  3.9   2.62  16.5 0 1 4 4
> #>  2  21   6  160110  3.9   2.88  17.0 0 1 4 4
> #>  3  22.8 4  108 93  3.85  2.32  18.6 1 1 4 1
> #>  4  21.4 6  258110  3.08  3.22  19.4 1 0 3 1
> #>  5  18.7 8  360175  3.15  3.44  17.0 0 0 3 2
> #>  6  18.1 6  225105  2.76  3.46  20.2 1 0 3 1
> #>  7  14.3 8  360245  3.21  3.57  15.8 0 0 3 4
> #>  8  24.4 4  147.62  3.69  3.19  20   1 0 4 2
> #>  9  22.8 4  141.95  3.92  3.15  22.9 1 0 4 2
> #> 10  19.2 6  168.   123  3.92  3.44  18.3 1 0 4 4
> #> # … with 22 more rows
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17429) [R] Error messages are not helpful of read_csv_arrow with col_types option

2022-10-01 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya resolved ARROW-17429.
---
Resolution: Fixed

Seems fixed by ARROW-17355

> [R] Error messages are not helpful of read_csv_arrow with col_types option
> --
>
> Key: ARROW-17429
> URL: https://issues.apache.org/jira/browse/ARROW-17429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The error message displayed when a non-convertible type is specified does not 
> seem to help in the development version.
> {code:r}
> tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
> csv_file <- tempfile()
> on.exit(unlink(csv_file))
> write.csv(tbl, csv_file, row.names = FALSE)
> arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01 00:00:00
> arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01T12:00:00+12:00
> arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
> #> Error in as.data.frame(tab): object 'tab' not found
> arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
> #> Error in as.data.frame(tab): object 'tab' not found
> {code}
> In arrow 9.0.0
> {code:r}
> tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
> csv_file <- tempfile()
> on.exit(unlink(csv_file))
> write.csv(tbl, csv_file, row.names = FALSE)
> arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01 00:00:00
> arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01T12:00:00+12:00
> arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
> #> Error:
> #> ! Invalid: In CSV column #0: CSV conversion error to int32: invalid value 
> '1970-01-01T12:00:00+12:00'
> arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
> #> Error:
> #> ! Invalid: In CSV column #0: CSV conversion error to timestamp[ns]: 
> expected no zone offset in '1970-01-01T12:00:00+12:00'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17737) [R] Continue to retain grouping metadata even if ungroup arrow dplyr query

2022-09-20 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya reassigned ARROW-17737:
-

Assignee: SHIMA Tatsuya

> [R] Continue to retain grouping metadata even if ungroup arrow dplyr query
> --
>
> Key: ARROW-17737
> URL: https://issues.apache.org/jira/browse/ARROW-17737
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>
> Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it 
> becomes an arrow dplyr query.
> And it must also be written back again when converted to a Table.
> {code:r}
> mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> 
> as.data.frame() |> dplyr::group_vars()
> #> character(0)
> mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> 
> as.data.frame() |> dplyr::group_vars()
> #> [1] "cyl"
> {code}
> {code:r}
> mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::group_by(vs, 
> .add = TRUE) |> dplyr::ungroup() |> dplyr::collect()
> #> # A tibble: 32 × 11
> #> # Groups:   cyl [3]
> #>  mpg   cyl  disphp  dratwt  qsecvsam  gear  carb
> #>  
> #>  1  21   6  160110  3.9   2.62  16.5 0 1 4 4
> #>  2  21   6  160110  3.9   2.88  17.0 0 1 4 4
> #>  3  22.8 4  108 93  3.85  2.32  18.6 1 1 4 1
> #>  4  21.4 6  258110  3.08  3.22  19.4 1 0 3 1
> #>  5  18.7 8  360175  3.15  3.44  17.0 0 0 3 2
> #>  6  18.1 6  225105  2.76  3.46  20.2 1 0 3 1
> #>  7  14.3 8  360245  3.21  3.57  15.8 0 0 3 4
> #>  8  24.4 4  147.62  3.69  3.19  20   1 0 4 2
> #>  9  22.8 4  141.95  3.92  3.15  22.9 1 0 4 2
> #> 10  19.2 6  168.   123  3.92  3.44  18.3 1 0 4 4
> #> # … with 22 more rows
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17737) [R] Continue to retain grouping metadata even if ungroup arrow dplyr query

2022-09-17 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17737:
--
Description: 
Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it 
becomes an arrow dplyr query.
And it must also be written back again when converted to a Table.

{code:r}
mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> 
as.data.frame() |> dplyr::group_vars()
#> character(0)
mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> 
as.data.frame() |> dplyr::group_vars()
#> [1] "cyl"
{code}

{code:r}
mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::group_by(vs, 
.add = TRUE) |> dplyr::ungroup() |> dplyr::collect()
#> # A tibble: 32 × 11
#> # Groups:   cyl [3]
#>  mpg   cyl  disphp  dratwt  qsecvsam  gear  carb
#>  
#>  1  21   6  160110  3.9   2.62  16.5 0 1 4 4
#>  2  21   6  160110  3.9   2.88  17.0 0 1 4 4
#>  3  22.8 4  108 93  3.85  2.32  18.6 1 1 4 1
#>  4  21.4 6  258110  3.08  3.22  19.4 1 0 3 1
#>  5  18.7 8  360175  3.15  3.44  17.0 0 0 3 2
#>  6  18.1 6  225105  2.76  3.46  20.2 1 0 3 1
#>  7  14.3 8  360245  3.21  3.57  15.8 0 0 3 4
#>  8  24.4 4  147.62  3.69  3.19  20   1 0 4 2
#>  9  22.8 4  141.95  3.92  3.15  22.9 1 0 4 2
#> 10  19.2 6  168.   123  3.92  3.44  18.3 1 0 4 4
#> # … with 22 more rows
{code}

  was:
Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it 
becomes an arrow dplyr query.
And it must also be written back again when converted to a Table.

{code:r}
mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> 
as.data.frame() |> dplyr::group_vars()
#> character(0)
mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> 
as.data.frame() |> dplyr::group_vars()
#> [1] "cyl"
{code}


> [R] Continue to retain grouping metadata even if ungroup arrow dplyr query
> --
>
> Key: ARROW-17737
> URL: https://issues.apache.org/jira/browse/ARROW-17737
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it 
> becomes an arrow dplyr query.
> And it must also be written back again when converted to a Table.
> {code:r}
> mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> 
> as.data.frame() |> dplyr::group_vars()
> #> character(0)
> mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> 
> as.data.frame() |> dplyr::group_vars()
> #> [1] "cyl"
> {code}
> {code:r}
> mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::group_by(vs, 
> .add = TRUE) |> dplyr::ungroup() |> dplyr::collect()
> #> # A tibble: 32 × 11
> #> # Groups:   cyl [3]
> #>  mpg   cyl  disphp  dratwt  qsecvsam  gear  carb
> #>  
> #>  1  21   6  160110  3.9   2.62  16.5 0 1 4 4
> #>  2  21   6  160110  3.9   2.88  17.0 0 1 4 4
> #>  3  22.8 4  108 93  3.85  2.32  18.6 1 1 4 1
> #>  4  21.4 6  258110  3.08  3.22  19.4 1 0 3 1
> #>  5  18.7 8  360175  3.15  3.44  17.0 0 0 3 2
> #>  6  18.1 6  225105  2.76  3.46  20.2 1 0 3 1
> #>  7  14.3 8  360245  3.21  3.57  15.8 0 0 3 4
> #>  8  24.4 4  147.62  3.69  3.19  20   1 0 4 2
> #>  9  22.8 4  141.95  3.92  3.15  22.9 1 0 4 2
> #> 10  19.2 6  168.   123  3.92  3.44  18.3 1 0 4 4
> #> # … with 22 more rows
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17738) [R] dplyr::compute should convert from grouped arrow_dplyr_query to arrow Table

2022-09-17 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606158#comment-17606158
 ] 

SHIMA Tatsuya commented on ARROW-17738:
---

I think it is confusing to users when compute does not result in a Table as 
intended when the group is left after summarise, etc. is executed.

{code:r}
mtcars |> arrow::arrow_table() |> dplyr::group_by(vs, am) |> 
dplyr::summarise(wt = mean(wt)) |> dplyr::compute()
#> Table (query)
#> vs: double
#> am: double
#> wt: double
#>
#> * Grouped by vs
#> See $.data for the source Arrow object
{code}

> [R] dplyr::compute should convert from grouped arrow_dplyr_query to arrow 
> Table
> ---
>
> Key: ARROW-17738
> URL: https://issues.apache.org/jira/browse/ARROW-17738
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It is expected that {{dplyr::compute()}} will perform the calculation on the 
> arrow dplyr query and convert it to a Table, but it does not seem to work 
> correctly for grouped arrow dplyr queries and does not result in a Table.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> 
> dplyr::compute() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> {{as_arrow_table()}} works fine.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> dplyr::collect(FALSE) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> arrow::as_arrow_table() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> It seems to revert to arrow dplyr query in the following line.
> [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17738) [R] dplyr::compute should convert from grouped arrow_dplyr_query to arrow Table

2022-09-17 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17738:
--
Summary: [R] dplyr::compute should convert from grouped arrow_dplyr_query 
to arrow Table  (was: [R] dplyr::compute converts from grouped 
arrow_dplyr_query to arrow Table)

> [R] dplyr::compute should convert from grouped arrow_dplyr_query to arrow 
> Table
> ---
>
> Key: ARROW-17738
> URL: https://issues.apache.org/jira/browse/ARROW-17738
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It is expected that {{dplyr::compute()}} will perform the calculation on the 
> arrow dplyr query and convert it to a Table, but it does not seem to work 
> correctly for grouped arrow dplyr queries and does not result in a Table.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> 
> dplyr::compute() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> {{as_arrow_table()}} works fine.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> dplyr::collect(FALSE) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> arrow::as_arrow_table() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> It seems to revert to arrow dplyr query in the following line.
> [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17738) [R] dplyr::compute converts from grouped arrow_dplyr_query to arrow Table

2022-09-17 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17738:
--
Summary: [R] dplyr::compute converts from grouped arrow_dplyr_query to 
arrow Table  (was: [R] dplyr::compute does not work for grouped arrow dplyr 
query)

> [R] dplyr::compute converts from grouped arrow_dplyr_query to arrow Table
> -
>
> Key: ARROW-17738
> URL: https://issues.apache.org/jira/browse/ARROW-17738
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It is expected that {{dplyr::compute()}} will perform the calculation on the 
> arrow dplyr query and convert it to a Table, but it does not seem to work 
> correctly for grouped arrow dplyr queries and does not result in a Table.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> 
> dplyr::compute() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> {{as_arrow_table()}} works fine.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> dplyr::collect(FALSE) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> arrow::as_arrow_table() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> It seems to revert to arrow dplyr query in the following line.
> [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query

2022-09-17 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606134#comment-17606134
 ] 

SHIMA Tatsuya commented on ARROW-17738:
---

Ah, is this the intended behavior?
I didn't understand why this behavior was intended, I think compute should 
return a Table here, just as dbplyr and dtplyr do.

> [R] dplyr::compute does not work for grouped arrow dplyr query
> --
>
> Key: ARROW-17738
> URL: https://issues.apache.org/jira/browse/ARROW-17738
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It is expected that {{dplyr::compute()}} will perform the calculation on the 
> arrow dplyr query and convert it to a Table, but it does not seem to work 
> correctly for grouped arrow dplyr queries and does not result in a Table.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> 
> dplyr::compute() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> {{as_arrow_table()}} works fine.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> dplyr::collect(FALSE) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> arrow::as_arrow_table() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> It seems to revert to arrow dplyr query in the following line.
> [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query

2022-09-17 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya reassigned ARROW-17738:
-

Assignee: SHIMA Tatsuya

> [R] dplyr::compute does not work for grouped arrow dplyr query
> --
>
> Key: ARROW-17738
> URL: https://issues.apache.org/jira/browse/ARROW-17738
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>
> It is expected that {{dplyr::compute()}} will perform the calculation on the 
> arrow dplyr query and convert it to a Table, but it does not seem to work 
> correctly for grouped arrow dplyr queries and does not result in a Table.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> 
> dplyr::compute() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> {{as_arrow_table()}} works fine.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> dplyr::collect(FALSE) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> arrow::as_arrow_table() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> It seems to revert to arrow dplyr query in the following line.
> [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query

2022-09-17 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya reassigned ARROW-17738:
-

Assignee: (was: SHIMA Tatsuya)

> [R] dplyr::compute does not work for grouped arrow dplyr query
> --
>
> Key: ARROW-17738
> URL: https://issues.apache.org/jira/browse/ARROW-17738
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> It is expected that {{dplyr::compute()}} will perform the calculation on the 
> arrow dplyr query and convert it to a Table, but it does not seem to work 
> correctly for grouped arrow dplyr queries and does not result in a Table.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> 
> dplyr::compute() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> {{as_arrow_table()}} works fine.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> dplyr::collect(FALSE) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> arrow::as_arrow_table() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> It seems to revert to arrow dplyr query in the following line.
> [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query

2022-09-17 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya reassigned ARROW-17738:
-

Assignee: SHIMA Tatsuya

> [R] dplyr::compute does not work for grouped arrow dplyr query
> --
>
> Key: ARROW-17738
> URL: https://issues.apache.org/jira/browse/ARROW-17738
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>
> It is expected that {{dplyr::compute()}} will perform the calculation on the 
> arrow dplyr query and convert it to a Table, but it does not seem to work 
> correctly for grouped arrow dplyr queries and does not result in a Table.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> 
> dplyr::compute() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> {{as_arrow_table()}} works fine.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> dplyr::collect(FALSE) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> arrow::as_arrow_table() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> It seems to revert to arrow dplyr query in the following line.
> [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query

2022-09-17 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606116#comment-17606116
 ] 

SHIMA Tatsuya commented on ARROW-17738:
---

I have updated the description.
Grouped arrow dplyr queries are not converted to tables by {{dplyr::compute}}.

> [R] dplyr::compute does not work for grouped arrow dplyr query
> --
>
> Key: ARROW-17738
> URL: https://issues.apache.org/jira/browse/ARROW-17738
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> It is expected that {{dplyr::compute()}} will perform the calculation on the 
> arrow dplyr query and convert it to a Table, but it does not seem to work 
> correctly for grouped arrow dplyr queries and does not result in a Table.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> 
> dplyr::compute() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> {{as_arrow_table()}} works fine.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> dplyr::collect(FALSE) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> arrow::as_arrow_table() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> It seems to revert to arrow dplyr query in the following line.
> [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query

2022-09-17 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17738:
--
Description: 
It is expected that {{dplyr::compute()}} will perform the calculation on the 
arrow dplyr query and convert it to a Table, but it does not seem to work 
correctly for grouped arrow dplyr queries and does not result in a Table.

{code:r}
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> 
dplyr::compute() |> class()
#> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
{code}

{{as_arrow_table()}} works fine.

{code:r}
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::collect(FALSE) 
|> class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
arrow::as_arrow_table() |> class()
#> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
{code}

It seems to revert to arrow dplyr query in the following line.
[https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75]

 

  was:
It is expected that {{dplyr::compute()}} will perform the calculation on the 
arrow dplyr query and convert it to a Table, but it does not seem to work 
correctly for grouped arrow dplyr queries and does not result in a Table.
{code:r}
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> 
dplyr::compute() |> class()
#> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
{code}
{{as_arrow_table()}} works fine.
{code:r}
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::collect(FALSE) 
|> class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
arrow::as_arrow_table() |> class()
#> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
{code}
It seems to revert to arrow dplyr query in the following line.
[https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75]
 
 
dplyr::compute() がアロー dplyr クエリで計算を実行し、それをテーブルに変換することが期待されますが、グループ化されたアロー dplyr 
クエリでは正しく機能しないようで、結果がテーブルになりません。

 


> [R] dplyr::compute does not work for grouped arrow dplyr query
> --
>
> Key: ARROW-17738
> URL: https://issues.apache.org/jira/browse/ARROW-17738
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> It is expected that {{dplyr::compute()}} will perform the calculation on the 
> arrow dplyr query and convert it to a Table, but it does not seem to work 
> correctly for grouped arrow dplyr queries and does not result in a Table.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> 
> dplyr::compute() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> {{as_arrow_table()}} works fine.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> dplyr::collect(FALSE) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> arrow::as_arrow_table() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> It seems to revert to arrow dplyr query in the following line.
> [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query

2022-09-17 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17738:
--
Description: 
It is expected that {{dplyr::compute()}} will perform the calculation on the 
arrow dplyr query and convert it to a Table, but it does not seem to work 
correctly for grouped arrow dplyr queries and does not result in a Table.
{code:r}
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> 
dplyr::compute() |> class()
#> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
{code}
{{as_arrow_table()}} works fine.
{code:r}
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::collect(FALSE) 
|> class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
arrow::as_arrow_table() |> class()
#> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
{code}
It seems to revert to arrow dplyr query in the following line.
[https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75]
 
 
dplyr::compute() がアロー dplyr クエリで計算を実行し、それをテーブルに変換することが期待されますが、グループ化されたアロー dplyr 
クエリでは正しく機能しないようで、結果がテーブルになりません。

 

  was:
{code:r}
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::collect(FALSE) 
|> class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
arrow::as_arrow_table() |> class()
#> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
{code}

It seems to revert to arrow dplyr query in the following line.
https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75


> [R] dplyr::compute does not work for grouped arrow dplyr query
> --
>
> Key: ARROW-17738
> URL: https://issues.apache.org/jira/browse/ARROW-17738
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> It is expected that {{dplyr::compute()}} will perform the calculation on the 
> arrow dplyr query and convert it to a Table, but it does not seem to work 
> correctly for grouped arrow dplyr queries and does not result in a Table.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::ungroup() |> 
> dplyr::compute() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> {{as_arrow_table()}} works fine.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> dplyr::collect(FALSE) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> arrow::as_arrow_table() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> It seems to revert to arrow dplyr query in the following line.
> [https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75]
>  
>  
> dplyr::compute() がアロー dplyr クエリで計算を実行し、それをテーブルに変換することが期待されますが、グループ化されたアロー 
> dplyr クエリでは正しく機能しないようで、結果がテーブルになりません。
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query

2022-09-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17738:
--
Issue Type: Bug  (was: Improvement)

> [R] dplyr::compute does not work for grouped arrow dplyr query
> --
>
> Key: ARROW-17738
> URL: https://issues.apache.org/jira/browse/ARROW-17738
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> dplyr::collect(FALSE) |> class()
> #> [1] "arrow_dplyr_query"
> mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
> arrow::as_arrow_table() |> class()
> #> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
> {code}
> It seems to revert to arrow dplyr query in the following line.
> https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17737) [R] Continue to retain grouping metadata even if ungroup arrow dplyr query

2022-09-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17737:
--
Issue Type: Bug  (was: Improvement)

> [R] Continue to retain grouping metadata even if ungroup arrow dplyr query
> --
>
> Key: ARROW-17737
> URL: https://issues.apache.org/jira/browse/ARROW-17737
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it 
> becomes an arrow dplyr query.
> And it must also be written back again when converted to a Table.
> {code:r}
> mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> 
> as.data.frame() |> dplyr::group_vars()
> #> character(0)
> mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> 
> as.data.frame() |> dplyr::group_vars()
> #> [1] "cyl"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17738) [R] dplyr::compute does not work for grouped arrow dplyr query

2022-09-15 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17738:
-

 Summary: [R] dplyr::compute does not work for grouped arrow dplyr 
query
 Key: ARROW-17738
 URL: https://issues.apache.org/jira/browse/ARROW-17738
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 9.0.0
Reporter: SHIMA Tatsuya


{code:r}
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::compute() |> 
class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::collect(FALSE) 
|> class()
#> [1] "arrow_dplyr_query"
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> 
arrow::as_arrow_table() |> class()
#> [1] "Table""ArrowTabular" "ArrowObject"  "R6"
{code}

It seems to revert to arrow dplyr query in the following line.
https://github.com/apache/arrow/blob/7cfdfbb0d5472f8f8893398b51042a3ca1dd0adf/r/R/dplyr-collect.R#L73-L75



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17737) [R] Continue to retain grouping metadata even if ungroup arrow dplyr query

2022-09-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17737:
--
Description: 
Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it 
becomes an arrow dplyr query.
And it must also be written back again when converted to a Table.

{code:r}
mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> 
as.data.frame() |> dplyr::group_vars()
#> character(0)
mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> 
as.data.frame() |> dplyr::group_vars()
#> [1] "cyl"
{code}

  was:
Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it 
becomes arrow dplyr query.
And it must also be written back again when converted to Table.

{code:r}
mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> 
as.data.frame() |> dplyr::group_vars()
#> character(0)
mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> 
as.data.frame() |> dplyr::group_vars()
#> [1] "cyl"
{code}


> [R] Continue to retain grouping metadata even if ungroup arrow dplyr query
> --
>
> Key: ARROW-17737
> URL: https://issues.apache.org/jira/browse/ARROW-17737
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it 
> becomes an arrow dplyr query.
> And it must also be written back again when converted to a Table.
> {code:r}
> mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> 
> as.data.frame() |> dplyr::group_vars()
> #> character(0)
> mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> 
> as.data.frame() |> dplyr::group_vars()
> #> [1] "cyl"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17737) [R] Continue to retain grouping metadata even if ungroup arrow dplyr query

2022-09-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17737:
--
Description: 
Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it 
becomes arrow dplyr query.
And it must also be written back again when converted to Table.

{code:r}
mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> 
as.data.frame() |> dplyr::group_vars()
#> character(0)
mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> 
as.data.frame() |> dplyr::group_vars()
#> [1] "cyl"
{code}

  was:
Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it 
becomes arrow dplyr query.

{code:r}
mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> 
as.data.frame() |> dplyr::group_vars()
#> character(0)
mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> 
as.data.frame() |> dplyr::group_vars()
#> [1] "cyl"
{code}


> [R] Continue to retain grouping metadata even if ungroup arrow dplyr query
> --
>
> Key: ARROW-17737
> URL: https://issues.apache.org/jira/browse/ARROW-17737
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it 
> becomes arrow dplyr query.
> And it must also be written back again when converted to Table.
> {code:r}
> mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> 
> as.data.frame() |> dplyr::group_vars()
> #> character(0)
> mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> 
> as.data.frame() |> dplyr::group_vars()
> #> [1] "cyl"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17737) [R] Continue to retain grouping metadata even if ungroup arrow dplyr query

2022-09-15 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17737:
-

 Summary: [R] Continue to retain grouping metadata even if ungroup 
arrow dplyr query
 Key: ARROW-17737
 URL: https://issues.apache.org/jira/browse/ARROW-17737
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 9.0.0
Reporter: SHIMA Tatsuya


Perhaps {{metadata$r$attributes$.group_vars}} needs to be removed when it 
becomes arrow dplyr query.

{code:r}
mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table() |> dplyr::ungroup() |> 
as.data.frame() |> dplyr::group_vars()
#> character(0)
mtcars |> dplyr::group_by(cyl) |> arrow:::as_adq() |> dplyr::ungroup() |> 
as.data.frame() |> dplyr::group_vars()
#> [1] "cyl"
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17727) [R] Implement dplyr::across() inside group_by()

2022-09-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya resolved ARROW-17727.
---
Resolution: Duplicate

I'm sorry, I created this without noticing ARROW-17689.

> [R] Implement dplyr::across() inside group_by()
> ---
>
> Key: ARROW-17727
> URL: https://issues.apache.org/jira/browse/ARROW-17727
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 10.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17689) [R] Implement dplyr::across() inside group_by()

2022-09-14 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya reassigned ARROW-17689:
-

Assignee: SHIMA Tatsuya

> [R] Implement dplyr::across() inside group_by()
> ---
>
> Key: ARROW-17689
> URL: https://issues.apache.org/jira/browse/ARROW-17689
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: SHIMA Tatsuya
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17727) [R] Implement dplyr::across() inside group_by()

2022-09-14 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17727:
-

 Summary: [R] Implement dplyr::across() inside group_by()
 Key: ARROW-17727
 URL: https://issues.apache.org/jira/browse/ARROW-17727
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 10.0.0
Reporter: SHIMA Tatsuya
Assignee: SHIMA Tatsuya






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17724) [R] Allow package name prefix inside dplyr::across's .fns argument

2022-09-14 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17724:
-

 Summary: [R] Allow package name prefix inside dplyr::across's .fns 
argument
 Key: ARROW-17724
 URL: https://issues.apache.org/jira/browse/ARROW-17724
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 10.0.0
Reporter: SHIMA Tatsuya


This is not a major issue, but may be worth mentioning as a known limitation.

{code:r}
library(dplyr, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
mtcars |> arrow::arrow_table() |> mutate(across(starts_with("c"), 
base::as.character)) |> collect()
#> Error in base(cyl): could not find function "base"
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17416) [R] Implement lubridate::with_tz and lubridate::force_tz

2022-09-11 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya reassigned ARROW-17416:
-

Assignee: SHIMA Tatsuya

> [R] Implement lubridate::with_tz and lubridate::force_tz
> 
>
> Key: ARROW-17416
> URL: https://issues.apache.org/jira/browse/ARROW-17416
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17674) [R] Implement dplyr::across() inside arrange()

2022-09-10 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17674:
-

 Summary: [R] Implement dplyr::across() inside arrange()
 Key: ARROW-17674
 URL: https://issues.apache.org/jira/browse/ARROW-17674
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 9.0.0
Reporter: SHIMA Tatsuya
Assignee: SHIMA Tatsuya






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17673) [R] `desc` in `dplyr::arrange` should allow `dplyr::` prefix

2022-09-10 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17673:
--
Summary: [R] `desc` in `dplyr::arrange` should allow `dplyr::` prefix  
(was: [R] desc in dplyr::arrange should allow dplyr:: prefix)

> [R] `desc` in `dplyr::arrange` should allow `dplyr::` prefix
> 
>
> Key: ARROW-17673
> URL: https://issues.apache.org/jira/browse/ARROW-17673
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>
> This example works.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::arrange(desc(cyl)) |> 
> dplyr::collect()
> {code}
> But next one is not supported now.
> {code:r}
> mtcars |> arrow::arrow_table() |> dplyr::arrange(dplyr::desc(cyl)) |> 
> dplyr::collect()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17673) [R] desc in dplyr::arrange should allow dplyr:: prefix

2022-09-10 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17673:
-

 Summary: [R] desc in dplyr::arrange should allow dplyr:: prefix
 Key: ARROW-17673
 URL: https://issues.apache.org/jira/browse/ARROW-17673
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 9.0.0
Reporter: SHIMA Tatsuya
Assignee: SHIMA Tatsuya


This example works.

{code:r}
mtcars |> arrow::arrow_table() |> dplyr::arrange(desc(cyl)) |> dplyr::collect()
{code}

But next one is not supported now.

{code:r}
mtcars |> arrow::arrow_table() |> dplyr::arrange(dplyr::desc(cyl)) |> 
dplyr::collect()
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17432) [R] messed up rows when importing large csv into parquet

2022-08-25 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584903#comment-17584903
 ] 

SHIMA Tatsuya commented on ARROW-17432:
---

Hi, how about passing the schema to the {{col_types}} argument?

{code:r}
csv_stream <- open_dataset(csv_file, format = "csv", 
   col_types = sch)
{code}

Or, using {{readr::read_csv()}}?

I also wonder if the number of rows in the dataset fetched is the same in all 
cases.

> [R] messed up rows when importing large csv into parquet
> 
>
> Key: ARROW-17432
> URL: https://issues.apache.org/jira/browse/ARROW-17432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0, 9.0.0
> Environment: R version 4.2.1
> Running in Arch Linux - EndeavourOS
> arrow_info()
> Arrow package version: 9.0.0
> Capabilities:
>
> datasetTRUE
> substrait FALSE
> parquetTRUE
> json   TRUE
> s3 TRUE
> gcsTRUE
> utf8proc   TRUE
> re2TRUE
> snappy TRUE
> gzip   TRUE
> brotli TRUE
> zstd   TRUE
> lz4TRUE
> lz4_frame  TRUE
> lzo   FALSE
> bz2TRUE
> jemalloc   TRUE
> mimalloc   TRUE
> Memory:
>   
> Allocator jemalloc
> Current   49.31 Kb
> Max1.63 Mb
> Runtime:
> 
> SIMD Level  avx2
> Detected SIMD Level avx2
> Build:
>   
> C++ Library Version  9.0.0
> C++ Compiler   GNU
> C++ Compiler Version 7.5.0
> 
> print(pa.__version__)
> 9.0.0
>Reporter: Guillermo Duran
>Priority: Major
>
> This is a weird issue that creates new rows when importing a large csv (56 
> GB) into parquet in R. It occurred with both R Arrow 8.0.0 and 9.0.0 BUT 
> didn't occur with the Python Arrow library 9.0.0. Due to the large size of 
> the original csv it's difficult to create a reproducible example, but I share 
> the code and outputs.
> The code I use in R to import the csv:
> {code:java}
> library(arrow)
> library(dplyr)
>  
> csv_file <- "/ebird_erd2021/full/obs.csv"
> dest <- "/ebird_erd2021/full/obs_parquet/" 
> sch = arrow::schema(checklist_id = float32(),
>                     species_code = string(),
>                     exotic_category = float32(),
>                     obs_count = float32(),
>                     only_presence_reported = float32(),
>                     only_slash_reported = float32(),
>                     valid = float32(),
>                     reviewed = float32(),
>                     has_media = float32()
>                     )
> csv_stream <- open_dataset(csv_file, format = "csv", 
>                            schema = sch, skip_rows = 1)
> write_dataset(csv_stream, dest, format = "parquet", 
>               max_rows_per_file=100L,
>               hive_style = TRUE,
>               existing_data_behavior = "overwrite"){code}
> When I load the dataset and check one random _checklist_id_ I get rows that 
> are not part of the _obs.csv_ file. There shouldn't be duplicated species in 
> a checklist but there are ({_}amerob{_} for example)...  also note that the 
> duplicated species have different {_}obs_count{_}. 50 species in total in 
> that specific {_}checklist_id{_}.
> {code:java}
> parquet_arrow <- open_dataset(dest, format = "parquet")
> parquet_arrow |> 
>   filter(checklist_id == 18543372) |> 
>   arrange(species_code) |> 
>   collect() 
> # A tibble: 50 × 3
>checklist_id species_code obs_count
>
>  1 18543372 altori   3
>  2 18543372 amekes   1
>  3 18543372 amered  40
>  4 18543372 amerob  30
>  5 18543372 amerob   9
>  6 18543372 balori   9
>  7 18543372 blkter   9
>  8 18543372 blkvul  20
>  9 18543372 buggna   1
> 10 18543372 buwwar   1
> # … with 40 more rows
> # ℹ Use `print(n = ...)` to see more rows{code}
> If I use awk to query the csv file with that same checklist id, I get 
> something different:
> {code:java}
> $ awk -F "," '{ if ($1 == 18543372) { print } }' obs.csv
> 18543372.0,rewbla,,60.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,amerob,,30.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,robgro,,2.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,eastow,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,sedwre1,,2.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,ovenbi1,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,buggna,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,reshaw,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,turvul,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,gowwar,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,balori,,9.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,buwwar,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,grycat,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,cangoo,,6.0,0.0,0.0,1.0,0.0,0.

[jira] [Commented] (ARROW-17439) [R] pull() should compute() not collect()

2022-08-23 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583872#comment-17583872
 ] 

SHIMA Tatsuya commented on ARROW-17439:
---

Since {{pull()}} can have additional arguments, why not add an argument that 
controls whether it should return an arrow structure or an R vector, like the 
{{as_data_frame}} argument that {{read_csv_arrow()}} and others have?

> [R] pull() should compute() not collect()
> -
>
> Key: ARROW-17439
> URL: https://issues.apache.org/jira/browse/ARROW-17439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Bryce Mecum
>Priority: Major
>  Labels: good-first-issue
>
> Currently {{pull()}} returns an R vector, but it's the only dplyr verb other 
> than {{collect()}} that returns an R data structure. And there's no other 
> natural way to extract a ChunkedArray from the result of an arrow query.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-15734) [R][Docs] Enable searching R docs

2022-08-21 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya closed ARROW-15734.
-
Resolution: Won't Fix

> [R][Docs] Enable searching R docs
> -
>
> Key: ARROW-15734
> URL: https://issues.apache.org/jira/browse/ARROW-15734
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
> Attachments: bs5.png, fixed-bs5.png, 
> image-2022-03-01-00-33-12-050.png, image-2022-03-01-00-46-51-350.png, 
> updated-list.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Enable Bootstrap 5 in pkgdown website to use the built-in search feature.
> Do you have any plans to switch to Bootstrap 5?
> https://pkgdown.r-lib.org/articles/search.html
> https://pkgdown.r-lib.org/articles/customise.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17485) [R] Allow TRUE/FALSE to the compression option of `write_feather` (`write_ipc_file`)

2022-08-20 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17485:
--
Summary: [R] Allow TRUE/FALSE to the compression option of `write_feather` 
(`write_ipc_file`)  (was: [R] Allow TRUE/FALSE to the compression option of 
`write_feather`(`write_ipc_file`))

> [R] Allow TRUE/FALSE to the compression option of `write_feather` 
> (`write_ipc_file`)
> 
>
> Key: ARROW-17485
> URL: https://issues.apache.org/jira/browse/ARROW-17485
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We may want to create an uncompressed IPC file to share with JavaScript.
> https://quarto.org/docs/interactive/ojs/data-sources.html#file-attachments
> Currently, to do this, we need to set up the following, but the string 
> "uncompressed" is long and does not benefit from auto-completion by the IDE, 
> making it difficult to write code.
> {code:r}
> arrow::write_feather(mtcars, "data.arrow", compression = "uncompressed")
> {code}
> It would be useful to write the following.
> {code:r}
> arrow::write_feather(mtcars, "data.arrow", compression = FALSE)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17485) [R] Allow TRUE/FALSE to the compression option of `write_feather`(`write_ipc_file`)

2022-08-20 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17485:
--
Summary: [R] Allow TRUE/FALSE to the compression option of 
`write_feather`(`write_ipc_file`)  (was: [R] Allow TRUE/FALSE to the 
compression option of `write_feather`)

> [R] Allow TRUE/FALSE to the compression option of 
> `write_feather`(`write_ipc_file`)
> ---
>
> Key: ARROW-17485
> URL: https://issues.apache.org/jira/browse/ARROW-17485
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We may want to create an uncompressed IPC file to share with JavaScript.
> https://quarto.org/docs/interactive/ojs/data-sources.html#file-attachments
> Currently, to do this, we need to set up the following, but the string 
> "uncompressed" is long and does not benefit from auto-completion by the IDE, 
> making it difficult to write code.
> {code:r}
> arrow::write_feather(mtcars, "data.arrow", compression = "uncompressed")
> {code}
> It would be useful to write the following.
> {code:r}
> arrow::write_feather(mtcars, "data.arrow", compression = FALSE)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17485) [R] Allow TRUE/FALSE to the compression option of `write_feather`

2022-08-20 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17485:
--
Summary: [R] Allow TRUE/FALSE to the compression option of `write_feather`  
(was: [R] Allow FALSE to the compression option of `write_feather`)

> [R] Allow TRUE/FALSE to the compression option of `write_feather`
> -
>
> Key: ARROW-17485
> URL: https://issues.apache.org/jira/browse/ARROW-17485
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>
> We may want to create an uncompressed IPC file to share with JavaScript.
> https://quarto.org/docs/interactive/ojs/data-sources.html#file-attachments
> Currently, to do this, we need to set up the following, but the string 
> "uncompressed" is long and does not benefit from auto-completion by the IDE, 
> making it difficult to write code.
> {code:r}
> arrow::write_feather(mtcars, "data.arrow", compression = "uncompressed")
> {code}
> It would be useful to write the following.
> {code:r}
> arrow::write_feather(mtcars, "data.arrow", compression = FALSE)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17485) [R] Allow FALSE to the compression option of `write_feather`

2022-08-20 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya reassigned ARROW-17485:
-

Assignee: SHIMA Tatsuya

> [R] Allow FALSE to the compression option of `write_feather`
> 
>
> Key: ARROW-17485
> URL: https://issues.apache.org/jira/browse/ARROW-17485
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>
> We may want to create an uncompressed IPC file to share with JavaScript.
> https://quarto.org/docs/interactive/ojs/data-sources.html#file-attachments
> Currently, to do this, we need to set up the following, but the string 
> "uncompressed" is long and does not benefit from auto-completion by the IDE, 
> making it difficult to write code.
> {code:r}
> arrow::write_feather(mtcars, "data.arrow", compression = "uncompressed")
> {code}
> It would be useful to write the following.
> {code:r}
> arrow::write_feather(mtcars, "data.arrow", compression = FALSE)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17485) [R] Allow FALSE to the compression option of `write_feather`

2022-08-20 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17485:
-

 Summary: [R] Allow FALSE to the compression option of 
`write_feather`
 Key: ARROW-17485
 URL: https://issues.apache.org/jira/browse/ARROW-17485
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 9.0.0
Reporter: SHIMA Tatsuya


We may want to create an uncompressed IPC file to share with JavaScript.
https://quarto.org/docs/interactive/ojs/data-sources.html#file-attachments

Currently, to do this, we need to set up the following, but the string 
"uncompressed" is long and does not benefit from auto-completion by the IDE, 
making it difficult to write code.

{code:r}
arrow::write_feather(mtcars, "data.arrow", compression = "uncompressed")
{code}

It would be useful to write the following.

{code:r}
arrow::write_feather(mtcars, "data.arrow", compression = FALSE)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17439) [R] pull() should compute() not collect()

2022-08-19 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581935#comment-17581935
 ] 

SHIMA Tatsuya commented on ARROW-17439:
---

Note that in dbplyr and dtplyr, pull returns a vector in R (as it should).
https://dbplyr.tidyverse.org/reference/pull.tbl_sql.html

I think this change in behavior makes sense, but may confuse users in its 
current state without references.

> [R] pull() should compute() not collect()
> -
>
> Key: ARROW-17439
> URL: https://issues.apache.org/jira/browse/ARROW-17439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Bryce Mecum
>Priority: Major
>  Labels: good-first-issue
>
> Currently {{pull()}} returns an R vector, but it's the only dplyr verb other 
> than {{collect()}} that returns an R data structure. And there's no other 
> natural way to extract a ChunkedArray from the result of an arrow query.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17429) [R] Error messages are not helpful of read_csv_arrow with col_types option

2022-08-19 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya reassigned ARROW-17429:
-

Assignee: SHIMA Tatsuya

> [R] Error messages are not helpful of read_csv_arrow with col_types option
> --
>
> Key: ARROW-17429
> URL: https://issues.apache.org/jira/browse/ARROW-17429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>
> The error message displayed when a non-convertible type is specified does not 
> seem to help in the development version.
> {code:r}
> tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
> csv_file <- tempfile()
> on.exit(unlink(csv_file))
> write.csv(tbl, csv_file, row.names = FALSE)
> arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01 00:00:00
> arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01T12:00:00+12:00
> arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
> #> Error in as.data.frame(tab): object 'tab' not found
> arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
> #> Error in as.data.frame(tab): object 'tab' not found
> {code}
> In arrow 9.0.0
> {code:r}
> tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
> csv_file <- tempfile()
> on.exit(unlink(csv_file))
> write.csv(tbl, csv_file, row.names = FALSE)
> arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01 00:00:00
> arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01T12:00:00+12:00
> arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
> #> Error:
> #> ! Invalid: In CSV column #0: CSV conversion error to int32: invalid value 
> '1970-01-01T12:00:00+12:00'
> arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
> #> Error:
> #> ! Invalid: In CSV column #0: CSV conversion error to timestamp[ns]: 
> expected no zone offset in '1970-01-01T12:00:00+12:00'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17429) [R] Error messages are not helpful of read_csv_arrow with col_types option

2022-08-19 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581850#comment-17581850
 ] 

SHIMA Tatsuya commented on ARROW-17429:
---

This issue appears to have been introduced by 
https://github.com/apache/arrow/pull/12826.

> [R] Error messages are not helpful of read_csv_arrow with col_types option
> --
>
> Key: ARROW-17429
> URL: https://issues.apache.org/jira/browse/ARROW-17429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> The error message displayed when a non-convertible type is specified does not 
> seem to help in the development version.
> {code:r}
> tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
> csv_file <- tempfile()
> on.exit(unlink(csv_file))
> write.csv(tbl, csv_file, row.names = FALSE)
> arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01 00:00:00
> arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01T12:00:00+12:00
> arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
> #> Error in as.data.frame(tab): object 'tab' not found
> arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
> #> Error in as.data.frame(tab): object 'tab' not found
> {code}
> In arrow 9.0.0
> {code:r}
> tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
> csv_file <- tempfile()
> on.exit(unlink(csv_file))
> write.csv(tbl, csv_file, row.names = FALSE)
> arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01 00:00:00
> arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01T12:00:00+12:00
> arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
> #> Error:
> #> ! Invalid: In CSV column #0: CSV conversion error to int32: invalid value 
> '1970-01-01T12:00:00+12:00'
> arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
> #> Error:
> #> ! Invalid: In CSV column #0: CSV conversion error to timestamp[ns]: 
> expected no zone offset in '1970-01-01T12:00:00+12:00'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17425) [R] `lubridate::as_datetime()` in dplyr query should be able to handle time in sub seconds

2022-08-16 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17425:
--
Description: 
Since the current unit is fixed to "s", an error will occur if a time 
containing sub-seconds is given.
{code:r}
"1970-01-01T00:00:59.123456789" |>
  arrow::arrow_table(x = _) |>
  dplyr::mutate(x = lubridate::as_datetime(x, tz = "UTC")) |>
  dplyr::collect()
#> Error in `dplyr::collect()`:
#> ! Invalid: Failed to parse string: '1970-01-01T00:00:59.123456789' as a 
scalar of type timestamp[s]
{code}
I thought that nanoseconds should be used, but it should be noted that POSIXct 
is currently supposed to be converted to microseconds, as shown in ARROW-17424.
 
 
 

 

  was:
Since the current unit is fixed to "s", an error will occur if a time 
containing sub-seconds is given.

{code:r}
"1970-01-01T00:00:59.123456789" |>
  data.frame(x = _) |>
  arrow::arrow_table() |>
  dplyr::mutate(x = lubridate::as_datetime(x, tz = "UTC")) |>
  dplyr::collect()
#> Error in `dplyr::collect()`:
#> ! Invalid: Failed to parse string: '1970-01-01T00:00:59.123456789' as a 
scalar of type timestamp[s]
{code}

I thought that nanoseconds should be used, but it should be noted that POSIXct 
is currently supposed to be converted to microseconds, as shown in ARROW-17424.


> [R] `lubridate::as_datetime()` in dplyr query should be able to handle time 
> in sub seconds
> --
>
> Key: ARROW-17425
> URL: https://issues.apache.org/jira/browse/ARROW-17425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Since the current unit is fixed to "s", an error will occur if a time 
> containing sub-seconds is given.
> {code:r}
> "1970-01-01T00:00:59.123456789" |>
>   arrow::arrow_table(x = _) |>
>   dplyr::mutate(x = lubridate::as_datetime(x, tz = "UTC")) |>
>   dplyr::collect()
> #> Error in `dplyr::collect()`:
> #> ! Invalid: Failed to parse string: '1970-01-01T00:00:59.123456789' as a 
> scalar of type timestamp[s]
> {code}
> I thought that nanoseconds should be used, but it should be noted that 
> POSIXct is currently supposed to be converted to microseconds, as shown in 
> ARROW-17424.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17414) [R]: Lack of `assume_timezone` binding

2022-08-16 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580336#comment-17580336
 ] 

SHIMA Tatsuya commented on ARROW-17414:
---

Yes.
Perhaps if dplyr is not used, {{call_function}} should be used, and it would be 
great if that could be indicated in the error message as well.

> [R]: Lack of `assume_timezone` binding
> --
>
> Key: ARROW-17414
> URL: https://issues.apache.org/jira/browse/ARROW-17414
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> If we run the following code in R, we will get a C++ derived error message 
> telling us to use {{assume_timezone}}.
> However, this error message is not helpful because there is no binding for 
> the {{assume_timezone}} function in R.
> {code:r}
> tf <- tempfile()
> writeLines("2004-04-01 12:00", tf)
> arrow::read_csv_arrow(tf, schema = arrow::schema(col1 = arrow::timestamp("s", 
> "UTC")))
> #> Error:
> #> ! Invalid: In CSV column #0: CSV conversion error to timestamp[s, tz=UTC]: 
> expected a zone offset in '2004-04-01 12:00'. If these timestamps are in 
> local time, parse them as timestamps without timezone, then call 
> assume_timezone.
> #> ℹ If you have supplied a schema and your data contains a header row, you 
> should supply the argument `skip = 1` to prevent the header being read in as 
> data.
> {code}
> It would be useful to improve the error message or to allow 
> {{assume_timezone}} to be used from R as well.
> (although {{lubridate::with_tz()}} and {{lubridate::force_tz()}} could be 
> more useful within a dplyr query)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17424) [R] Microsecond is not sufficient unit for POSIXct

2022-08-16 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17424:
--
Description: 
I believe the {{POSIXct}} type or R currently corresponds to the Arrow 
{{timestamp[us, tz=UTC]}} type.

{code:r}
lubridate::as_datetime(0) |> arrow::infer_type()
#> Timestamp
#> timestamp[us, tz=UTC]
{code}

{code:r}
lubridate::as_datetime("1970-01-01 00:00:00.001") |>
  arrow::arrow_table(x = _)
#> Table
#> 1 rows x 1 columns
#> $x 
{code}

{code:r}
df_a <- lubridate::as_datetime("1970-01-01 00:00:00.001") |>
  arrow::arrow_table(x = _) |>
  as.data.frame()

df_b <- lubridate::as_datetime("1970-01-01 00:00:00.001") |>
  tibble::tibble(x = _)

waldo::compare(df_a, df_b)
#> `old$x`: "1970-01-01"
#> `new$x`: "1970-01-01 00:00:00"
{code}

However, as shown below, POSIXct may hold data finer than a microsecond.

{code:r}
lubridate::as_datetime(0.1) |> as.numeric()
#> [1] 1e-09
lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric()
#> [1] 1.192093e-07
{code}

I don't know why it is currently set in microseconds, but is there any reason 
not to set it in nanoseconds?

  was:
I believe the {{POSIXct}} type or R currently corresponds to the Arrow 
{{timestamp[us, tz=UTC]}} type.

{code:r}
lubridate::as_datetime(0) |> arrow::infer_type()
#> Timestamp
#> timestamp[us, tz=UTC]
{code}

{code:r}
df_a <- lubridate::as_datetime("1970-01-01 00:00:00.001") |>
  arrow::arrow_table(x = _) |>
  as.data.frame()

df_b <- lubridate::as_datetime("1970-01-01 00:00:00.001") |>
  tibble::tibble(x = _)

waldo::compare(df_a, df_b)
#> `old$x`: "1970-01-01"
#> `new$x`: "1970-01-01 00:00:00"
{code}

However, as shown below, POSIXct may hold data finer than a microsecond.

{code:r}
lubridate::as_datetime(0.1) |> as.numeric()
#> [1] 1e-09
lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric()
#> [1] 1.192093e-07
{code}

I don't know why it is currently set in microseconds, but is there any reason 
not to set it in nanoseconds?


> [R] Microsecond is not sufficient unit for POSIXct
> --
>
> Key: ARROW-17424
> URL: https://issues.apache.org/jira/browse/ARROW-17424
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> I believe the {{POSIXct}} type or R currently corresponds to the Arrow 
> {{timestamp[us, tz=UTC]}} type.
> {code:r}
> lubridate::as_datetime(0) |> arrow::infer_type()
> #> Timestamp
> #> timestamp[us, tz=UTC]
> {code}
> {code:r}
> lubridate::as_datetime("1970-01-01 00:00:00.001") |>
>   arrow::arrow_table(x = _)
> #> Table
> #> 1 rows x 1 columns
> #> $x 
> {code}
> {code:r}
> df_a <- lubridate::as_datetime("1970-01-01 00:00:00.001") |>
>   arrow::arrow_table(x = _) |>
>   as.data.frame()
> df_b <- lubridate::as_datetime("1970-01-01 00:00:00.001") |>
>   tibble::tibble(x = _)
> waldo::compare(df_a, df_b)
> #> `old$x`: "1970-01-01"
> #> `new$x`: "1970-01-01 00:00:00"
> {code}
> However, as shown below, POSIXct may hold data finer than a microsecond.
> {code:r}
> lubridate::as_datetime(0.1) |> as.numeric()
> #> [1] 1e-09
> lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric()
> #> [1] 1.192093e-07
> {code}
> I don't know why it is currently set in microseconds, but is there any reason 
> not to set it in nanoseconds?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17424) [R] Microsecond is not sufficient unit for POSIXct

2022-08-16 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17424:
--
Description: 
I believe the {{POSIXct}} type or R currently corresponds to the Arrow 
{{timestamp[us, tz=UTC]}} type.

{code:r}
lubridate::as_datetime(0) |> arrow::infer_type()
#> Timestamp
#> timestamp[us, tz=UTC]
{code}

{code:r}
df_a <- lubridate::as_datetime("1970-01-01 00:00:00.001") |>
  arrow::arrow_table(x = _) |>
  as.data.frame()

df_b <- lubridate::as_datetime("1970-01-01 00:00:00.001") |>
  tibble::tibble(x = _)

waldo::compare(df_a, df_b)
#> `old$x`: "1970-01-01"
#> `new$x`: "1970-01-01 00:00:00"
{code}

However, as shown below, POSIXct may hold data finer than a microsecond.

{code:r}
lubridate::as_datetime(0.1) |> as.numeric()
#> [1] 1e-09
lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric()
#> [1] 1.192093e-07
{code}

I don't know why it is currently set in microseconds, but is there any reason 
not to set it in nanoseconds?

  was:
I believe the {{POSIXct}} type or R currently corresponds to the Arrow 
{{timestamp[us, tz=UTC]}} type.

{code:r}
lubridate::as_datetime(0) |> arrow::infer_type()
#> Timestamp
#> timestamp[us, tz=UTC]
{code}

However, as shown below, POSIXct may hold data finer than a microsecond.

{code:r}
lubridate::as_datetime(0.1) |> as.numeric()
#> [1] 1e-09
lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric()
#> [1] 1.192093e-07
{code}

I don't know why it is currently set in microseconds, but is there any reason 
not to set it in nanoseconds?


> [R] Microsecond is not sufficient unit for POSIXct
> --
>
> Key: ARROW-17424
> URL: https://issues.apache.org/jira/browse/ARROW-17424
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> I believe the {{POSIXct}} type or R currently corresponds to the Arrow 
> {{timestamp[us, tz=UTC]}} type.
> {code:r}
> lubridate::as_datetime(0) |> arrow::infer_type()
> #> Timestamp
> #> timestamp[us, tz=UTC]
> {code}
> {code:r}
> df_a <- lubridate::as_datetime("1970-01-01 00:00:00.001") |>
>   arrow::arrow_table(x = _) |>
>   as.data.frame()
> df_b <- lubridate::as_datetime("1970-01-01 00:00:00.001") |>
>   tibble::tibble(x = _)
> waldo::compare(df_a, df_b)
> #> `old$x`: "1970-01-01"
> #> `new$x`: "1970-01-01 00:00:00"
> {code}
> However, as shown below, POSIXct may hold data finer than a microsecond.
> {code:r}
> lubridate::as_datetime(0.1) |> as.numeric()
> #> [1] 1e-09
> lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric()
> #> [1] 1.192093e-07
> {code}
> I don't know why it is currently set in microseconds, but is there any reason 
> not to set it in nanoseconds?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17429) [R] Error messages are not helpful of read_csv_arrow with col_types option

2022-08-16 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17429:
--
Description: 
The error message displayed when a non-convertible type is specified does not 
seem to help in the development version.

{code:r}
tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
csv_file <- tempfile()
on.exit(unlink(csv_file))
write.csv(tbl, csv_file, row.names = FALSE)

arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 1970-01-01 00:00:00
arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 1970-01-01T12:00:00+12:00
arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
#> Error in as.data.frame(tab): object 'tab' not found
arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
#> Error in as.data.frame(tab): object 'tab' not found
{code}



  was:
The error message displayed when a non-convertible type is specified does not 
seem to help in the HEAD.

{code:r}
tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
csv_file <- tempfile()
on.exit(unlink(csv_file))
write.csv(tbl, csv_file, row.names = FALSE)

arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 1970-01-01 00:00:00
arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 1970-01-01T12:00:00+12:00
arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
#> Error in as.data.frame(tab): object 'tab' not found
arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
#> Error in as.data.frame(tab): object 'tab' not found
{code}




> [R] Error messages are not helpful of read_csv_arrow with col_types option
> --
>
> Key: ARROW-17429
> URL: https://issues.apache.org/jira/browse/ARROW-17429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> The error message displayed when a non-convertible type is specified does not 
> seem to help in the development version.
> {code:r}
> tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
> csv_file <- tempfile()
> on.exit(unlink(csv_file))
> write.csv(tbl, csv_file, row.names = FALSE)
> arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01 00:00:00
> arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01T12:00:00+12:00
> arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
> #> Error in as.data.frame(tab): object 'tab' not found
> arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
> #> Error in as.data.frame(tab): object 'tab' not found
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17429) [R] Error messages are not helpful of read_csv_arrow with col_types option

2022-08-16 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17429:
--
Affects Version/s: (was: 9.0.0)

> [R] Error messages are not helpful of read_csv_arrow with col_types option
> --
>
> Key: ARROW-17429
> URL: https://issues.apache.org/jira/browse/ARROW-17429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> The error message displayed when a non-convertible type is specified does not 
> seem to help in the development version.
> {code:r}
> tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
> csv_file <- tempfile()
> on.exit(unlink(csv_file))
> write.csv(tbl, csv_file, row.names = FALSE)
> arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01 00:00:00
> arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01T12:00:00+12:00
> arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
> #> Error in as.data.frame(tab): object 'tab' not found
> arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
> #> Error in as.data.frame(tab): object 'tab' not found
> {code}
> In arrow 9.0.0
> {code:r}
> tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
> csv_file <- tempfile()
> on.exit(unlink(csv_file))
> write.csv(tbl, csv_file, row.names = FALSE)
> arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01 00:00:00
> arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01T12:00:00+12:00
> arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
> #> Error:
> #> ! Invalid: In CSV column #0: CSV conversion error to int32: invalid value 
> '1970-01-01T12:00:00+12:00'
> arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
> #> Error:
> #> ! Invalid: In CSV column #0: CSV conversion error to timestamp[ns]: 
> expected no zone offset in '1970-01-01T12:00:00+12:00'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17429) [R] Error messages are not helpful of read_csv_arrow with col_types option

2022-08-16 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17429:
--
Description: 
The error message displayed when a non-convertible type is specified does not 
seem to help in the development version.

{code:r}
tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
csv_file <- tempfile()
on.exit(unlink(csv_file))
write.csv(tbl, csv_file, row.names = FALSE)

arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 1970-01-01 00:00:00
arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 1970-01-01T12:00:00+12:00
arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
#> Error in as.data.frame(tab): object 'tab' not found
arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
#> Error in as.data.frame(tab): object 'tab' not found
{code}

In arrow 9.0.0

{code:r}
tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
csv_file <- tempfile()
on.exit(unlink(csv_file))
write.csv(tbl, csv_file, row.names = FALSE)

arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 1970-01-01 00:00:00
arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 1970-01-01T12:00:00+12:00
arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
#> Error:
#> ! Invalid: In CSV column #0: CSV conversion error to int32: invalid value 
'1970-01-01T12:00:00+12:00'
arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
#> Error:
#> ! Invalid: In CSV column #0: CSV conversion error to timestamp[ns]: expected 
no zone offset in '1970-01-01T12:00:00+12:00'
{code}


  was:
The error message displayed when a non-convertible type is specified does not 
seem to help in the development version.

{code:r}
tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
csv_file <- tempfile()
on.exit(unlink(csv_file))
write.csv(tbl, csv_file, row.names = FALSE)

arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 1970-01-01 00:00:00
arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 1970-01-01T12:00:00+12:00
arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
#> Error in as.data.frame(tab): object 'tab' not found
arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
#> Error in as.data.frame(tab): object 'tab' not found
{code}




> [R] Error messages are not helpful of read_csv_arrow with col_types option
> --
>
> Key: ARROW-17429
> URL: https://issues.apache.org/jira/browse/ARROW-17429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> The error message displayed when a non-convertible type is specified does not 
> seem to help in the development version.
> {code:r}
> tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
> csv_file <- tempfile()
> on.exit(unlink(csv_file))
> write.csv(tbl, csv_file, row.names = FALSE)
> arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01 00:00:00
> arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01T12:00:00+12:00
> arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
> #> Error in as.data.frame(tab): object 'tab' not found
> arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
> #> Error in as.data.frame(tab): object 'tab' not found
> {code}
> In arrow 9.0.0
> {code:r}
> tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
> csv_file <- tempfile()
> on.exit(unlink(csv_file))
> write.csv(tbl, csv_file, row.names = FALSE)
> arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01 00:00:00
> arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01T12:00:00+12:00
> arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
> #> Error:
> #> ! Invalid: In CSV column #0: CSV conversion error to int32: invalid value 
> '1970-01-01T12:00:00+12:00'
> arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
> #> Error:
> #> ! Invalid: In CSV column #0: CSV conversion error to timestamp[ns]: 
> expected no zone offset in '1970-01-01T12:00:00+12:00'
> {code}



[jira] [Updated] (ARROW-17429) [R] Error messages are not helpful of read_csv_arrow with col_types option

2022-08-16 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17429:
--
Description: 
The error message displayed when a non-convertible type is specified does not 
seem to help in the HEAD.

{code:r}
tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
csv_file <- tempfile()
on.exit(unlink(csv_file))
write.csv(tbl, csv_file, row.names = FALSE)

arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 1970-01-01 00:00:00
arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 1970-01-01T12:00:00+12:00
arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
#> Error in as.data.frame(tab): object 'tab' not found
arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
#> Error in as.data.frame(tab): object 'tab' not found
{code}



  was:
The error message displayed when a non-convertible type is specified does not 
seem to help.

{code:r}
tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
csv_file <- tempfile()
on.exit(unlink(csv_file))
write.csv(tbl, csv_file, row.names = FALSE)

arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 1970-01-01 00:00:00
arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 1970-01-01T12:00:00+12:00
arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
#> Error in as.data.frame(tab): object 'tab' not found
arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
#> Error in as.data.frame(tab): object 'tab' not found
{code}


> [R] Error messages are not helpful of read_csv_arrow with col_types option
> --
>
> Key: ARROW-17429
> URL: https://issues.apache.org/jira/browse/ARROW-17429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> The error message displayed when a non-convertible type is specified does not 
> seem to help in the HEAD.
> {code:r}
> tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
> csv_file <- tempfile()
> on.exit(unlink(csv_file))
> write.csv(tbl, csv_file, row.names = FALSE)
> arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01 00:00:00
> arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
> #> # A tibble: 1 × 1
> #>   x
> #>   
> #> 1 1970-01-01T12:00:00+12:00
> arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
> #> Error in as.data.frame(tab): object 'tab' not found
> arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
> #> Error in as.data.frame(tab): object 'tab' not found
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17429) [R] Error messages are not helpful of read_csv_arrow with col_types option

2022-08-16 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17429:
-

 Summary: [R] Error messages are not helpful of read_csv_arrow with 
col_types option
 Key: ARROW-17429
 URL: https://issues.apache.org/jira/browse/ARROW-17429
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 9.0.0
Reporter: SHIMA Tatsuya


The error message displayed when a non-convertible type is specified does not 
seem to help.

{code:r}
tbl <- tibble::tibble(time = c("1970-01-01T12:00:00+12:00"))
csv_file <- tempfile()
on.exit(unlink(csv_file))
write.csv(tbl, csv_file, row.names = FALSE)

arrow::read_csv_arrow(csv_file, col_types = "?", col_names = "x", skip = 1)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 1970-01-01 00:00:00
arrow::read_csv_arrow(csv_file, col_types = "c", col_names = "x", skip = 1)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 1970-01-01T12:00:00+12:00
arrow::read_csv_arrow(csv_file, col_types = "i", col_names = "x", skip = 1)
#> Error in as.data.frame(tab): object 'tab' not found
arrow::read_csv_arrow(csv_file, col_types = "T", col_names = "x", skip = 1)
#> Error in as.data.frame(tab): object 'tab' not found
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17428) [R] Implement as.integer and as.numeric for timestamp types etc. in Arrow dplyr queries

2022-08-16 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17428:
-

 Summary: [R] Implement as.integer and as.numeric for timestamp 
types etc. in Arrow dplyr queries
 Key: ARROW-17428
 URL: https://issues.apache.org/jira/browse/ARROW-17428
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 9.0.0
Reporter: SHIMA Tatsuya


In R, the POSIXct type are converted to seconds, so division and rounding are 
required within arrow, depending on the unit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17414) [R]: Lack of `assume_timezone` binding

2022-08-15 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580063#comment-17580063
 ] 

SHIMA Tatsuya commented on ARROW-17414:
---

Thank you for pointing out how to call the compute functions.
Ideally, it would be great if we could add the with_tz function and update the 
error messages?

> [R]: Lack of `assume_timezone` binding
> --
>
> Key: ARROW-17414
> URL: https://issues.apache.org/jira/browse/ARROW-17414
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> If we run the following code in R, we will get a C++ derived error message 
> telling us to use {{assume_timezone}}.
> However, this error message is not helpful because there is no binding for 
> the {{assume_timezone}} function in R.
> {code:r}
> tf <- tempfile()
> writeLines("2004-04-01 12:00", tf)
> arrow::read_csv_arrow(tf, schema = arrow::schema(col1 = arrow::timestamp("s", 
> "UTC")))
> #> Error:
> #> ! Invalid: In CSV column #0: CSV conversion error to timestamp[s, tz=UTC]: 
> expected a zone offset in '2004-04-01 12:00'. If these timestamps are in 
> local time, parse them as timestamps without timezone, then call 
> assume_timezone.
> #> ℹ If you have supplied a schema and your data contains a header row, you 
> should supply the argument `skip = 1` to prevent the header being read in as 
> data.
> {code}
> It would be useful to improve the error message or to allow 
> {{assume_timezone}} to be used from R as well.
> (although {{lubridate::with_tz()}} and {{lubridate::force_tz()}} could be 
> more useful within a dplyr query)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17425) [R] `lubridate::as_datetime()` in dplyr query should be able to handle time in sub seconds

2022-08-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17425:
--
Summary: [R] `lubridate::as_datetime()` in dplyr query should be able to 
handle time in sub seconds  (was: [R] lubridate::as_datetime() etc. in dplyr 
query should be able to handle time in sub seconds)

> [R] `lubridate::as_datetime()` in dplyr query should be able to handle time 
> in sub seconds
> --
>
> Key: ARROW-17425
> URL: https://issues.apache.org/jira/browse/ARROW-17425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since the current unit is fixed to "s", an error will occur if a time 
> containing sub-seconds is given.
> {code:r}
> "1970-01-01T00:00:59.123456789" |>
>   data.frame(x = _) |>
>   arrow::arrow_table() |>
>   dplyr::mutate(x = lubridate::as_datetime(x, tz = "UTC")) |>
>   dplyr::collect()
> #> Error in `dplyr::collect()`:
> #> ! Invalid: Failed to parse string: '1970-01-01T00:00:59.123456789' as a 
> scalar of type timestamp[s]
> {code}
> I thought that nanoseconds should be used, but it should be noted that 
> POSIXct is currently supposed to be converted to microseconds, as shown in 
> ARROW-17424.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17425) [R] lubridate::as_datetime() etc. in dplyr query should be able to handle time in sub seconds

2022-08-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya reassigned ARROW-17425:
-

Assignee: SHIMA Tatsuya

> [R] lubridate::as_datetime() etc. in dplyr query should be able to handle 
> time in sub seconds
> -
>
> Key: ARROW-17425
> URL: https://issues.apache.org/jira/browse/ARROW-17425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>
> Since the current unit is fixed to "s", an error will occur if a time 
> containing sub-seconds is given.
> {code:r}
> "1970-01-01T00:00:59.123456789" |>
>   data.frame(x = _) |>
>   arrow::arrow_table() |>
>   dplyr::mutate(x = lubridate::as_datetime(x, tz = "UTC")) |>
>   dplyr::collect()
> #> Error in `dplyr::collect()`:
> #> ! Invalid: Failed to parse string: '1970-01-01T00:00:59.123456789' as a 
> scalar of type timestamp[s]
> {code}
> I thought that nanoseconds should be used, but it should be noted that 
> POSIXct is currently supposed to be converted to microseconds, as shown in 
> ARROW-17424.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND

2022-08-15 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579851#comment-17579851
 ] 

SHIMA Tatsuya commented on ARROW-17374:
---

I think what you are seeing in the error message is Python 3.1, not Python 3.10.
I am not sure, but are you specifying the Python version in yaml or something?  
In yaml, if you just write {{3.10}}, it 
will be interpreted as the number {{3.1}}, so you need to string {{"3.10"}}.

If this is an irrelevant comment, please ignore it.

> [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
> --
>
> Key: ARROW-17374
> URL: https://issues.apache.org/jira/browse/ARROW-17374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0, 9.0.0, 8.0.1
> Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64
>Reporter: Shane Brennan
>Priority: Blocker
>
> I've been trying to install Arrow on an R notebook within AWS SageMaker. 
> SageMaker provides Jupyter-like notebooks, with each instance running Amazon 
> Linux 2 as its OS, itself based on RHEL. 
> Trying to install a few ways, e.g., using the standard binaries, using the 
> nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all 
> still result in the following error. 
> {noformat}
> x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared 
> -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common 
> -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags 
> -Wl,--gc-sections -Wl,--allow-shlib-undefined 
> -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib 
> -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib 
> -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o 
> array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o 
> compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o 
> expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o 
> json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o 
> recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o 
> schema.o symbols.o table.o threadpool.o type_infer.o 
> -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib
>  -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz 
> SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread 
> -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto 
> -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR
> x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or 
> directory
> make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: 
> arrow.so] Error 1{noformat}
> Snappy is installed on the systems, and both shared object (.so) and cmake 
> files are there, where I've tried setting the system env variables Snappy_DIR 
> and Snappy_LIB to point at them, but to no avail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17425) [R] lubridate::as_datetime() etc. in dplyr query should be able to handle time in sub seconds

2022-08-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17425:
--
Summary: [R] lubridate::as_datetime() etc. in dplyr query should be able to 
handle time in sub seconds  (was: [R] [R] lubridate::as_datetime() etc. in 
dplyr query should be able to handle time in sub seconds)

> [R] lubridate::as_datetime() etc. in dplyr query should be able to handle 
> time in sub seconds
> -
>
> Key: ARROW-17425
> URL: https://issues.apache.org/jira/browse/ARROW-17425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Since the current unit is fixed to "s", an error will occur if a time 
> containing sub-seconds is given.
> {code:r}
> "1970-01-01T00:00:59.123456789" |>
>   data.frame(x = _) |>
>   arrow::arrow_table() |>
>   dplyr::mutate(x = lubridate::as_datetime(x, tz = "UTC")) |>
>   dplyr::collect()
> #> Error in `dplyr::collect()`:
> #> ! Invalid: Failed to parse string: '1970-01-01T00:00:59.123456789' as a 
> scalar of type timestamp[s]
> {code}
> I thought that nanoseconds should be used, but it should be noted that 
> POSIXct is currently supposed to be converted to microseconds, as shown in 
> ARROW-17424.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17424) [R] Microsecond is not sufficient units for POSIXct

2022-08-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17424:
--
Summary: [R] Microsecond is not sufficient units for POSIXct  (was: [R] 
Microseconds are not sufficient units for POSIXct)

> [R] Microsecond is not sufficient units for POSIXct
> ---
>
> Key: ARROW-17424
> URL: https://issues.apache.org/jira/browse/ARROW-17424
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> I believe the {{POSIXct}} type or R currently corresponds to the Arrow 
> {{timestamp[us, tz=UTC]}} type.
> {code:r}
> lubridate::as_datetime(0) |> arrow::infer_type()
> #> Timestamp
> #> timestamp[us, tz=UTC]
> {code}
> However, as shown below, POSIXct may hold data finer than a microsecond.
> {code:r}
> lubridate::as_datetime(0.1) |> as.numeric()
> #> [1] 1e-09
> lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric()
> #> [1] 1.192093e-07
> {code}
> I don't know why it is currently set in microseconds, but is there any reason 
> not to set it in nanoseconds?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17424) [R] Microsecond is not sufficient unit for POSIXct

2022-08-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17424:
--
Summary: [R] Microsecond is not sufficient unit for POSIXct  (was: [R] 
Microsecond is not sufficient units for POSIXct)

> [R] Microsecond is not sufficient unit for POSIXct
> --
>
> Key: ARROW-17424
> URL: https://issues.apache.org/jira/browse/ARROW-17424
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> I believe the {{POSIXct}} type or R currently corresponds to the Arrow 
> {{timestamp[us, tz=UTC]}} type.
> {code:r}
> lubridate::as_datetime(0) |> arrow::infer_type()
> #> Timestamp
> #> timestamp[us, tz=UTC]
> {code}
> However, as shown below, POSIXct may hold data finer than a microsecond.
> {code:r}
> lubridate::as_datetime(0.1) |> as.numeric()
> #> [1] 1e-09
> lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric()
> #> [1] 1.192093e-07
> {code}
> I don't know why it is currently set in microseconds, but is there any reason 
> not to set it in nanoseconds?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17425) [R] [R] lubridate::as_datetime() etc. in dplyr query should be able to handle time in sub seconds

2022-08-15 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17425:
-

 Summary: [R] [R] lubridate::as_datetime() etc. in dplyr query 
should be able to handle time in sub seconds
 Key: ARROW-17425
 URL: https://issues.apache.org/jira/browse/ARROW-17425
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 9.0.0
Reporter: SHIMA Tatsuya


Since the current unit is fixed to "s", an error will occur if a time 
containing sub-seconds is given.

{code:r}
"1970-01-01T00:00:59.123456789" |>
  data.frame(x = _) |>
  arrow::arrow_table() |>
  dplyr::mutate(x = lubridate::as_datetime(x, tz = "UTC")) |>
  dplyr::collect()
#> Error in `dplyr::collect()`:
#> ! Invalid: Failed to parse string: '1970-01-01T00:00:59.123456789' as a 
scalar of type timestamp[s]
{code}

I thought that nanoseconds should be used, but it should be noted that POSIXct 
is currently supposed to be converted to microseconds, as shown in ARROW-17424.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17424) [R] Microseconds are not sufficient units for POSIXct

2022-08-15 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17424:
-

 Summary: [R] Microseconds are not sufficient units for POSIXct
 Key: ARROW-17424
 URL: https://issues.apache.org/jira/browse/ARROW-17424
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 9.0.0
Reporter: SHIMA Tatsuya


I believe the {{POSIXct}} type or R currently corresponds to the Arrow 
{{timestamp[us, tz=UTC]}} type.

{code:r}
lubridate::as_datetime(0) |> arrow::infer_type()
#> Timestamp
#> timestamp[us, tz=UTC]
{code}

However, as shown below, POSIXct may hold data finer than a microsecond.

{code:r}
lubridate::as_datetime(0.1) |> as.numeric()
#> [1] 1e-09
lubridate::as_datetime("1970-01-01 00:00:00.001") |> as.numeric()
#> [1] 1.192093e-07
{code}

I don't know why it is currently set in microseconds, but is there any reason 
not to set it in nanoseconds?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17417) [R] Implement typeof() in Arrow dplyr queries

2022-08-15 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17417:
-

 Summary: [R] Implement typeof() in Arrow dplyr queries
 Key: ARROW-17417
 URL: https://issues.apache.org/jira/browse/ARROW-17417
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 9.0.0
Reporter: SHIMA Tatsuya


Currently this is useless because it always returns the string {{environment}}, 
but in dbplyr, duckdb and others have the typeof function, so it works as is.


{code:r}
dplyr::starwars |>
  dplyr::transmute(x = typeof(films))
#> # A tibble: 87 × 1
#>x
#>
#>  1 list
#>  2 list
#>  3 list
#>  4 list
#>  5 list
#>  6 list
#>  7 list
#>  8 list
#>  9 list
#> 10 list
#> # … with 77 more rows
#> # ℹ Use `print(n = ...)` to see more rows

dplyr::starwars |>
  arrow::to_duckdb() |>
  dplyr::transmute(x = typeof(films)) |>
  dplyr::collect()
#> # A tibble: 87 × 1
#>x
#>
#>  1 VARCHAR[]
#>  2 VARCHAR[]
#>  3 VARCHAR[]
#>  4 VARCHAR[]
#>  5 VARCHAR[]
#>  6 VARCHAR[]
#>  7 VARCHAR[]
#>  8 VARCHAR[]
#>  9 VARCHAR[]
#> 10 VARCHAR[]
#> # … with 77 more rows
#> # ℹ Use `print(n = ...)` to see more rows

dplyr::starwars |>
  arrow::arrow_table() |>
  dplyr::transmute(x = typeof(films)) |>
  dplyr::collect()
#> # A tibble: 87 × 1
#>x
#>
#>  1 environment
#>  2 environment
#>  3 environment
#>  4 environment
#>  5 environment
#>  6 environment
#>  7 environment
#>  8 environment
#>  9 environment
#> 10 environment
#> # … with 77 more rows
#> # ℹ Use `print(n = ...)` to see more rows
{code}

I would expect it to work as follows.

{code:r}
dplyr::starwars |>
  arrow::arrow_table() |>
  dplyr::transmute(x = arrow::infer_type(films)$ToString()) |>
  dplyr::collect()
#> # A tibble: 87 × 1
#>x
#>
#>  1 list
#>  2 list
#>  3 list
#>  4 list
#>  5 list
#>  6 list
#>  7 list
#>  8 list
#>  9 list
#> 10 list
#> # … with 77 more rows
#> # ℹ Use `print(n = ...)` to see more rows
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-17415) [R] Implement lubridate::with_tz and lubridate::force_tz

2022-08-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya closed ARROW-17415.
-
Resolution: Duplicate

Sorry, but it looks like I sent it twice with a network error.

> [R] Implement lubridate::with_tz and lubridate::force_tz
> 
>
> Key: ARROW-17415
> URL: https://issues.apache.org/jira/browse/ARROW-17415
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17415) [R] Implement lubridate::with_tz and lubridate::force_tz

2022-08-15 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17415:
-

 Summary: [R] Implement lubridate::with_tz and lubridate::force_tz
 Key: ARROW-17415
 URL: https://issues.apache.org/jira/browse/ARROW-17415
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 9.0.0
Reporter: SHIMA Tatsuya






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17416) [R] Implement lubridate::with_tz and lubridate::force_tz

2022-08-15 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17416:
-

 Summary: [R] Implement lubridate::with_tz and lubridate::force_tz
 Key: ARROW-17416
 URL: https://issues.apache.org/jira/browse/ARROW-17416
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 9.0.0
Reporter: SHIMA Tatsuya






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17414) [R]: Lack of `assume_timezone` binding

2022-08-15 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17414:
-

 Summary: [R]: Lack of `assume_timezone` binding
 Key: ARROW-17414
 URL: https://issues.apache.org/jira/browse/ARROW-17414
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 9.0.0
Reporter: SHIMA Tatsuya


If we run the following code in R, we will get a C++ derived error message 
telling us to use {{assume_timezone}}.
However, this error message is not helpful because there is no binding for the 
{{assume_timezone}} function in R.

{code:r}
tf <- tempfile()
writeLines("2004-04-01 12:00", tf)
arrow::read_csv_arrow(tf, schema = arrow::schema(col1 = arrow::timestamp("s", 
"UTC")))
#> Error:
#> ! Invalid: In CSV column #0: CSV conversion error to timestamp[s, tz=UTC]: 
expected a zone offset in '2004-04-01 12:00'. If these timestamps are in local 
time, parse them as timestamps without timezone, then call assume_timezone.
#> ℹ If you have supplied a schema and your data contains a header row, you 
should supply the argument `skip = 1` to prevent the header being read in as 
data.
{code}

It would be useful to improve the error message or to allow {{assume_timezone}} 
to be used from R as well.
(although {{lubridate::with_tz()}} and {{lubridate::force_tz()}} could be more 
useful within a dplyr query)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16318) [R]Timezone is not supported by to_duckdb()

2022-08-15 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579647#comment-17579647
 ] 

SHIMA Tatsuya commented on ARROW-16318:
---

I believe this is resolved by duckdb 0.4.0.

> [R]Timezone is not supported by to_duckdb()
> ---
>
> Key: ARROW-16318
> URL: https://issues.apache.org/jira/browse/ARROW-16318
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Zsolt Kegyes-Brassai
>Priority: Minor
>
> Here is a reproducible example:
>  
> {code:java}
> library(tidyverse)
> library(arrow)
> df1 <- tibble(time = lubridate::now(tzone = "UTC"))
> str(df1)
> #> tibble [1 x 1] (S3: tbl_df/tbl/data.frame)
> #>  $ time: POSIXct[1:1], format: "2022-04-25 12:50:10"
> write_dataset(df1, here::here("temp/df1"), format = "parquet")
> open_dataset(here::here("temp/df1")) |> 
>   to_duckdb()
> #> Error: duckdb_prepare_R: Failed to prepare query SELECT *
> #> FROM "arrow_001" AS "q01"
> #> WHERE (0 = 1)
> #> Error: Not implemented Error: Unsupported Internal Arrow Type tsu:UTC
> df2 <- tibble(time = lubridate::now())
> str(df2)
> #> tibble [1 x 1] (S3: tbl_df/tbl/data.frame)
> #>  $ time: POSIXct[1:1], format: "2022-04-25 14:50:11"
> write_dataset(df2, here::here("temp/df2"), format = "parquet")
> open_dataset(here::here("temp/df2")) |> 
>   to_duckdb()
> #> # Source:   table [?? x 1]
> #> # Database: duckdb_connection
> #>   time               
> #>                
> #> 1 2022-04-25 12:50:11
> {code}
>  
> The timestamps without timezone information are working fine.
> How one can remove easily the timezone information from {{timestamp }}type 
> column from a parquet dataset?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15602) [R][Docs] Update docs to explain how to read timestamp with timezone columns

2022-08-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-15602:
--
Summary: [R][Docs] Update docs to explain how to read timestamp with 
timezone columns  (was: [R] Update docs to explain how to read timestamp with 
timezone columns)

> [R][Docs] Update docs to explain how to read timestamp with timezone columns
> 
>
> Key: ARROW-15602
> URL: https://issues.apache.org/jira/browse/ARROW-15602
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: R version 4.1.2 (2021-11-01)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 20.04.3 LTS
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The following values in a csv file can be read as timestamp by 
> `pyarrow.csv.read_csv` and `readr::read_csv`, but not by 
> `arrow::read_csv_arrow`.
> {code}
> "x"
> "2004-04-01T12:00+09:00"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15602) [R] Update docs to explain how to read timestamp with timezone columns

2022-08-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya reassigned ARROW-15602:
-

Assignee: SHIMA Tatsuya

> [R] Update docs to explain how to read timestamp with timezone columns
> --
>
> Key: ARROW-15602
> URL: https://issues.apache.org/jira/browse/ARROW-15602
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: R version 4.1.2 (2021-11-01)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 20.04.3 LTS
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>
> The following values in a csv file can be read as timestamp by 
> `pyarrow.csv.read_csv` and `readr::read_csv`, but not by 
> `arrow::read_csv_arrow`.
> {code}
> "x"
> "2004-04-01T12:00+09:00"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15602) [R] Update docs to explain how to read timestamp with timezone columns

2022-08-14 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-15602:
--
Summary: [R] Update docs to explain how to read timestamp with timezone 
columns  (was: [R] can't read timestamp with timezone from CSV (or other 
delimited) file without options)

> [R] Update docs to explain how to read timestamp with timezone columns
> --
>
> Key: ARROW-15602
> URL: https://issues.apache.org/jira/browse/ARROW-15602
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: R version 4.1.2 (2021-11-01)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 20.04.3 LTS
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> The following values in a csv file can be read as timestamp by 
> `pyarrow.csv.read_csv` and `readr::read_csv`, but not by 
> `arrow::read_csv_arrow`.
> {code}
> "x"
> "2004-04-01T12:00+09:00"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15602) [R] can't read timestamp with timezone from CSV (or other delimited) file without options

2022-08-14 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579539#comment-17579539
 ] 

SHIMA Tatsuya commented on ARROW-15602:
---

I may have misunderstood something. So far the following example seems to work. 
An update to the documentation may be sufficient.

{code:r}
tf <- tempfile()
writeLines("x\n2004-04-01T12:00+09:00", tf)
arrow::read_csv_arrow(tf)
#> # A tibble: 1 × 1
#>   x
#>   
#> 1 2004-04-01 03:00:00
{code}

> [R] can't read timestamp with timezone from CSV (or other delimited) file 
> without options
> -
>
> Key: ARROW-15602
> URL: https://issues.apache.org/jira/browse/ARROW-15602
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: R version 4.1.2 (2021-11-01)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 20.04.3 LTS
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> The following values in a csv file can be read as timestamp by 
> `pyarrow.csv.read_csv` and `readr::read_csv`, but not by 
> `arrow::read_csv_arrow`.
> {code}
> "x"
> "2004-04-01T12:00+09:00"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15602) [R] can't read timestamp with timezone from CSV (or other delimited) file without options

2022-08-14 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-15602:
--
Summary: [R] can't read timestamp with timezone from CSV (or other 
delimited) file without options  (was: [R] Update docs to explain how to 
specify timezone in CSV parsing)

> [R] can't read timestamp with timezone from CSV (or other delimited) file 
> without options
> -
>
> Key: ARROW-15602
> URL: https://issues.apache.org/jira/browse/ARROW-15602
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: R version 4.1.2 (2021-11-01)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 20.04.3 LTS
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> The following values in a csv file can be read as timestamp by 
> `pyarrow.csv.read_csv` and `readr::read_csv`, but not by 
> `arrow::read_csv_arrow`.
> {code}
> "x"
> "2004-04-01T12:00+09:00"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17092) [Docs] Add note about "Feather" to the IPC file format document

2022-07-24 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya reassigned ARROW-17092:
-

Assignee: SHIMA Tatsuya

> [Docs] Add note about "Feather" to the IPC file format document
> ---
>
> Key: ARROW-17092
> URL: https://issues.apache.org/jira/browse/ARROW-17092
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 8.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The IPC file format is often referred to as Feather (especially in relation 
> to Python and R), but beginners are confused because the word "Feather" does 
> not appear on the IPC file format documentation.
> https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
> Note: This ticket was created as a result of a conversation with [~kou] on 
> Twitter.
> https://twitter.com/eitsupi/status/1547534742324920321



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17089) [Python] Use `.arrow` as extension for IPC file dataset

2022-07-21 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya reassigned ARROW-17089:
-

Assignee: SHIMA Tatsuya

> [Python] Use `.arrow` as extension for IPC file dataset
> ---
>
> Key: ARROW-17089
> URL: https://issues.apache.org/jira/browse/ARROW-17089
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>
> Same as ARROW-17088
> As noted in the following document, the recommended extension for IPC files 
> is now `.arrow`.
> > We recommend the “.arrow” extension for files created with this format.
> https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
> However, currently when writing a dataset with the 
> {{pyarrow.dataset.write_dataset}} function, the default extension is 
> {{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected.
> https://github.com/apache/arrow/blob/b8067151db9bfc04860285fdd8b5e73703346037/python/pyarrow/_dataset.pyx#L1149-L1151



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17143) [R] Add examples working with `tidyr::unnest`and `tidyr::unnest_longer`

2022-07-20 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17143:
--
Description: 
Related to ARROW-8813 ARROW-12099

The arrow package can convert json files to data frames very easily, but 
{{tidyr::unnest_longer}} is needed for array expansion.
Wonder if {{tidyr}} could be added to the recommended package and examples like 
this could be included in the documentation and test cases.

{code:r}
tf <- tempfile()
on.exit(unlink(tf))
writeLines('
{ "hello": 3.5, "world": false, "foo": { "bar": [ 1, 2 ] } }
{ "hello": 3.25, "world": null }
{ "hello": 0.0, "world": true, "foo": { "bar": [ 3, 4, 5 ] } }
  ', tf)

arrow::read_json_arrow(tf) |>
  tidyr::unnest(foo, names_sep = ".") |>
  tidyr::unnest_longer(foo.bar)
#> # A tibble: 6 × 3
#>   hello world foo.bar
#>   
#> 1  3.5  FALSE   1
#> 2  3.5  FALSE   2
#> 3  3.25 NA NA
#> 4  0TRUE3
#> 5  0TRUE4
#> 6  0TRUE5
{code}

  was:
Related to ARROW-8813

The arrow package can convert json files to data frames very easily, but 
{{tidyr::unnest_longer}} is needed for array expansion.
Wonder if {{tidyr}} could be added to the recommended package and examples like 
this could be included in the documentation and test cases.

{code:r}
tf <- tempfile()
on.exit(unlink(tf))
writeLines('
{ "hello": 3.5, "world": false, "foo": { "bar": [ 1, 2 ] } }
{ "hello": 3.25, "world": null }
{ "hello": 0.0, "world": true, "foo": { "bar": [ 3, 4, 5 ] } }
  ', tf)

arrow::read_json_arrow(tf) |>
  tidyr::unnest(foo, names_sep = ".") |>
  tidyr::unnest_longer(foo.bar)
#> # A tibble: 6 × 3
#>   hello world foo.bar
#>   
#> 1  3.5  FALSE   1
#> 2  3.5  FALSE   2
#> 3  3.25 NA NA
#> 4  0TRUE3
#> 5  0TRUE4
#> 6  0TRUE5
{code}


> [R] Add examples working with `tidyr::unnest`and `tidyr::unnest_longer`
> ---
>
> Key: ARROW-17143
> URL: https://issues.apache.org/jira/browse/ARROW-17143
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 8.0.1
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Related to ARROW-8813 ARROW-12099
> The arrow package can convert json files to data frames very easily, but 
> {{tidyr::unnest_longer}} is needed for array expansion.
> Wonder if {{tidyr}} could be added to the recommended package and examples 
> like this could be included in the documentation and test cases.
> {code:r}
> tf <- tempfile()
> on.exit(unlink(tf))
> writeLines('
> { "hello": 3.5, "world": false, "foo": { "bar": [ 1, 2 ] } }
> { "hello": 3.25, "world": null }
> { "hello": 0.0, "world": true, "foo": { "bar": [ 3, 4, 5 ] } }
>   ', tf)
> arrow::read_json_arrow(tf) |>
>   tidyr::unnest(foo, names_sep = ".") |>
>   tidyr::unnest_longer(foo.bar)
> #> # A tibble: 6 × 3
> #>   hello world foo.bar
> #>   
> #> 1  3.5  FALSE   1
> #> 2  3.5  FALSE   2
> #> 3  3.25 NA NA
> #> 4  0TRUE3
> #> 5  0TRUE4
> #> 6  0TRUE5
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17143) [R] Add examples working with `tidyr::unnest`and `tidyr::unnest_longer`

2022-07-20 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17143:
-

 Summary: [R] Add examples working with `tidyr::unnest`and 
`tidyr::unnest_longer`
 Key: ARROW-17143
 URL: https://issues.apache.org/jira/browse/ARROW-17143
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, R
Affects Versions: 8.0.1
Reporter: SHIMA Tatsuya


Related to ARROW-8813

The arrow package can convert json files to data frames very easily, but 
{{tidyr::unnest_longer}} is needed for array expansion.
Wonder if {{tidyr}} could be added to the recommended package and examples like 
this could be included in the documentation and test cases.

{code:r}
tf <- tempfile()
on.exit(unlink(tf))
writeLines('
{ "hello": 3.5, "world": false, "foo": { "bar": [ 1, 2 ] } }
{ "hello": 3.25, "world": null }
{ "hello": 0.0, "world": true, "foo": { "bar": [ 3, 4, 5 ] } }
  ', tf)

arrow::read_json_arrow(tf) |>
  tidyr::unnest(foo, names_sep = ".") |>
  tidyr::unnest_longer(foo.bar)
#> # A tibble: 6 × 3
#>   hello world foo.bar
#>   
#> 1  3.5  FALSE   1
#> 2  3.5  FALSE   2
#> 3  3.25 NA NA
#> 4  0TRUE3
#> 5  0TRUE4
#> 6  0TRUE5
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-8324) [R] Add read/write_ipc_file separate from _feather

2022-07-16 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya reassigned ARROW-8324:


Assignee: SHIMA Tatsuya

> [R] Add read/write_ipc_file separate from _feather
> --
>
> Key: ARROW-8324
> URL: https://issues.apache.org/jira/browse/ARROW-8324
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See [https://github.com/apache/arrow/pull/6771#issuecomment-608133760]
> {quote}Let's add read/write_ipc_file also? I'm wary of the "version" option 
> in "write_feather" and the Feather version inference capability in 
> "read_feather". It's potentially confusing and we may choose to add options 
> to write_ipc_file/read_ipc_file that are more developer centric, having to do 
> with particulars in the IPC format, that are not relevant or appropriate for 
> the Feather APIs.
> IMHO it's best for "Feather format" to remain an abstracted higher-level 
> concept with its use of the "IPC file format" as an implementation detail, 
> and segregated from the other things.
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-8324) [R] Add read/write_ipc_file separate from _feather

2022-07-16 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567451#comment-17567451
 ] 

SHIMA Tatsuya commented on ARROW-8324:
--

Yes, I looked at this.

Since the C++ library manages the Feather format version now, it seems easier 
to make {{write_ipc_file()}} a special case of {{write_feather()}} and 
{{read_ipc_file()}} simply an alias for {{read_feather()}} now.
In the future, these functions could be updated as the C++ library is updated.

> [R] Add read/write_ipc_file separate from _feather
> --
>
> Key: ARROW-8324
> URL: https://issues.apache.org/jira/browse/ARROW-8324
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See [https://github.com/apache/arrow/pull/6771#issuecomment-608133760]
> {quote}Let's add read/write_ipc_file also? I'm wary of the "version" option 
> in "write_feather" and the Feather version inference capability in 
> "read_feather". It's potentially confusing and we may choose to add options 
> to write_ipc_file/read_ipc_file that are more developer centric, having to do 
> with particulars in the IPC format, that are not relevant or appropriate for 
> the Feather APIs.
> IMHO it's best for "Feather format" to remain an abstracted higher-level 
> concept with its use of the "IPC file format" as an implementation detail, 
> and segregated from the other things.
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17089) [Python] Use `.arrow` as extension for IPC file dataset

2022-07-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17089:
--
Description: 
Same as ARROW-17088

As noted in the following document, the recommended extension for IPC files is 
now `.arrow`.

> We recommend the “.arrow” extension for files created with this format.
https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format

However, currently when writing a dataset with the 
{{pyarrow.dataset.write_dataset}} function, the default extension is 
{{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected.
https://github.com/apache/arrow/blob/b8067151db9bfc04860285fdd8b5e73703346037/python/pyarrow/_dataset.pyx#L1149-L1151

  was:
Same as ARROW-17088

As noted in the following document, the recommended extension for IPC files is 
now `.arrow`.

> We recommend the “.arrow” extension for files created with this format.
https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format

However, currently when writing a dataset with the 
{{pyarrow.dataset.write_dataset}} function, the default extension is 
{{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected.

https://github.com/apache/arrow/blob/b8067151db9bfc04860285fdd8b5e73703346037/python/pyarrow/_dataset.pyx#L1149-L1151


> [Python] Use `.arrow` as extension for IPC file dataset
> ---
>
> Key: ARROW-17089
> URL: https://issues.apache.org/jira/browse/ARROW-17089
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Same as ARROW-17088
> As noted in the following document, the recommended extension for IPC files 
> is now `.arrow`.
> > We recommend the “.arrow” extension for files created with this format.
> https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
> However, currently when writing a dataset with the 
> {{pyarrow.dataset.write_dataset}} function, the default extension is 
> {{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected.
> https://github.com/apache/arrow/blob/b8067151db9bfc04860285fdd8b5e73703346037/python/pyarrow/_dataset.pyx#L1149-L1151



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17089) [Python] Use `.arrow` as extension for IPC file dataset

2022-07-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17089:
--
Description: 
Same as ARROW-17088

As noted in the following document, the recommended extension for IPC files is 
now `.arrow`.

> We recommend the “.arrow” extension for files created with this format.
https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format

However, currently when writing a dataset with the 
{{pyarrow.dataset.write_dataset}} function, the default extension is 
{{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected.

https://github.com/apache/arrow/blob/b8067151db9bfc04860285fdd8b5e73703346037/python/pyarrow/_dataset.pyx#L1149-L1151

  was:
Same as ARROW-17088

As noted in the following document, the recommended extension for IPC files is 
now `.arrow`.

> We recommend the “.arrow” extension for files created with this format.
https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format

However, currently when writing a dataset with the 
{{pyarrow.dataset.write_dataset}} function, the default extension is 
{{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected.


> [Python] Use `.arrow` as extension for IPC file dataset
> ---
>
> Key: ARROW-17089
> URL: https://issues.apache.org/jira/browse/ARROW-17089
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> Same as ARROW-17088
> As noted in the following document, the recommended extension for IPC files 
> is now `.arrow`.
> > We recommend the “.arrow” extension for files created with this format.
> https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
> However, currently when writing a dataset with the 
> {{pyarrow.dataset.write_dataset}} function, the default extension is 
> {{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected.
> https://github.com/apache/arrow/blob/b8067151db9bfc04860285fdd8b5e73703346037/python/pyarrow/_dataset.pyx#L1149-L1151



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17092) [Docs] Add note about "Feather" to the IPC file format document

2022-07-15 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17092:
-

 Summary: [Docs] Add note about "Feather" to the IPC file format 
document
 Key: ARROW-17092
 URL: https://issues.apache.org/jira/browse/ARROW-17092
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 8.0.0
Reporter: SHIMA Tatsuya


The IPC file format is often referred to as Feather (especially in relation to 
Python and R), but beginners are confused because the word "Feather" does not 
appear on the IPC file format documentation.
https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format

Note: This ticket was created as a result of a conversation with [~kou] on 
Twitter.
https://twitter.com/eitsupi/status/1547534742324920321



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17089) [Python] Use `.arrow` as extension for IPC file dataset

2022-07-15 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17089:
-

 Summary: [Python] Use `.arrow` as extension for IPC file dataset
 Key: ARROW-17089
 URL: https://issues.apache.org/jira/browse/ARROW-17089
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 8.0.0
Reporter: SHIMA Tatsuya


Same as ARROW-17088

As noted in the following document, the recommended extension for IPC files is 
now `.arrow`.

> We recommend the “.arrow” extension for files created with this format.
https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format

However, currently when writing a dataset with the 
{{pyarrow.dataset.write_dataset}} function, the default extension is 
{{.feather}} when {{arrow}} or {{ipc}} or {{feather}} is selected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17088) [R] Use `.arrow` as extension of IPC files of datasets

2022-07-15 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17088:
-

 Summary: [R] Use `.arrow` as extension of IPC files of datasets
 Key: ARROW-17088
 URL: https://issues.apache.org/jira/browse/ARROW-17088
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 8.0.0
Reporter: SHIMA Tatsuya


Related to ARROW-17072

As noted in the following document, the recommended extension for IPC files is 
now `.arrow`.

> We recommend the “.arrow” extension for files created with this format.
https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format

However, currently when writing a dataset with the {{write_dataset}} function, 
the default extension is {{.feather}} when {{feather}} is selected as the 
format, and {{.ipc}} when {{ipc}} is selected.

https://github.com/apache/arrow/blob/f295da4cfdcf102d9ac2d16bbca6f8342fc3e6a8/r/R/dataset-write.R#L124-L126



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17085) [R] group_vars() should not return NULL

2022-07-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya reassigned ARROW-17085:
-

Assignee: SHIMA Tatsuya

> [R] group_vars() should not return NULL
> ---
>
> Key: ARROW-17085
> URL: https://issues.apache.org/jira/browse/ARROW-17085
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: SHIMA Tatsuya
>Assignee: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {code:r}
> mtcars |> dplyr::group_vars()
> #> character(0)
> mtcars |> arrow:::as_adq() |> dplyr::group_vars()
> #> character(0)
> mtcars |> arrow::arrow_table() |> dplyr::group_vars()
> #> NULL
> {code}
> {{dplyr::group_vars()}} does not return NULL, so the following 
> code will result in an error.
> {code:r}
> mtcars |> arrow::arrow_table() |> dtplyr::lazy_dt()
> #> Error in new_step(parent, vars = names(parent), groups = groups, locals = 
> list(), : is.character(groups) is not TRUE
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17085) [R] group_vars() should not return NULL

2022-07-15 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-17085:
--
Summary: [R] group_vars() should not return NULL  (was: [R] 
group_vars() returns NULL)

> [R] group_vars() should not return NULL
> ---
>
> Key: ARROW-17085
> URL: https://issues.apache.org/jira/browse/ARROW-17085
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> {code:r}
> mtcars |> dplyr::group_vars()
> #> character(0)
> mtcars |> arrow:::as_adq() |> dplyr::group_vars()
> #> character(0)
> mtcars |> arrow::arrow_table() |> dplyr::group_vars()
> #> NULL
> {code}
> {{dplyr::group_vars()}} does not return NULL, so the following 
> code will result in an error.
> {code:r}
> mtcars |> arrow::arrow_table() |> dtplyr::lazy_dt()
> #> Error in new_step(parent, vars = names(parent), groups = groups, locals = 
> list(), : is.character(groups) is not TRUE
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17085) [R] group_vars() returns NULL

2022-07-15 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567235#comment-17567235
 ] 

SHIMA Tatsuya commented on ARROW-17085:
---

> You may also want to change groups() to return an empty list() instead of 
> NULL.

I was not aware of this. I will take a look at this too.

> [R] group_vars() returns NULL
> ---
>
> Key: ARROW-17085
> URL: https://issues.apache.org/jira/browse/ARROW-17085
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> {code:r}
> mtcars |> dplyr::group_vars()
> #> character(0)
> mtcars |> arrow:::as_adq() |> dplyr::group_vars()
> #> character(0)
> mtcars |> arrow::arrow_table() |> dplyr::group_vars()
> #> NULL
> {code}
> {{dplyr::group_vars()}} does not return NULL, so the following 
> code will result in an error.
> {code:r}
> mtcars |> arrow::arrow_table() |> dtplyr::lazy_dt()
> #> Error in new_step(parent, vars = names(parent), groups = groups, locals = 
> list(), : is.character(groups) is not TRUE
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17085) [R] group_vars() returns NULL

2022-07-15 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567233#comment-17567233
 ] 

SHIMA Tatsuya commented on ARROW-17085:
---

Yes, I was trying that work. Will send PR after this.

> [R] group_vars() returns NULL
> ---
>
> Key: ARROW-17085
> URL: https://issues.apache.org/jira/browse/ARROW-17085
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> {code:r}
> mtcars |> dplyr::group_vars()
> #> character(0)
> mtcars |> arrow:::as_adq() |> dplyr::group_vars()
> #> character(0)
> mtcars |> arrow::arrow_table() |> dplyr::group_vars()
> #> NULL
> {code}
> {{dplyr::group_vars()}} does not return NULL, so the following 
> code will result in an error.
> {code:r}
> mtcars |> arrow::arrow_table() |> dtplyr::lazy_dt()
> #> Error in new_step(parent, vars = names(parent), groups = groups, locals = 
> list(), : is.character(groups) is not TRUE
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17085) [R] group_vars() returns NULL

2022-07-15 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17085:
-

 Summary: [R] group_vars() returns NULL
 Key: ARROW-17085
 URL: https://issues.apache.org/jira/browse/ARROW-17085
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 8.0.0
Reporter: SHIMA Tatsuya


{code:r}
mtcars |> dplyr::group_vars()
#> character(0)
mtcars |> arrow:::as_adq() |> dplyr::group_vars()
#> character(0)
mtcars |> arrow::arrow_table() |> dplyr::group_vars()
#> NULL
{code}

{{dplyr::group_vars()}} does not return NULL, so the following code 
will result in an error.

{code:r}
mtcars |> arrow::arrow_table() |> dtplyr::lazy_dt()
#> Error in new_step(parent, vars = names(parent), groups = groups, locals = 
list(), : is.character(groups) is not TRUE
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17072) [R] Rename *_feather functions

2022-07-15 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567175#comment-17567175
 ] 

SHIMA Tatsuya commented on ARROW-17072:
---

Thank you both and I agree to close this in favor of ARROW-8324.
Perhaps after adding *_ipc_file functions, the various documents need to be 
updated to recommend the use of _ipc_file functions.

> [R] Rename *_feather functions
> --
>
> Key: ARROW-17072
> URL: https://issues.apache.org/jira/browse/ARROW-17072
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> As noted in the following document, the recommended extension for IPC files 
> is now `.arrow`.
> > We recommend the “.arrow” extension for files created with this format.
> https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
> However, the R library cannot read IPC files without using the `read_feather` 
> function after ARROW-16268.
> I think users will be confused if you keep using this function name because 
> the word `feather' is not associated with `.arrow` for beginners.
> For example, could we deprecate function `read_feather` and recommend another 
> function like `read_ipc_file`, which has the same functionality?
> Note: This ticket was created as a result of a conversation with [~kou] on 
> Twitter.
> https://twitter.com/ktou/status/1547373388687376386



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17072) [R] Rename *_feather functions

2022-07-15 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567175#comment-17567175
 ] 

SHIMA Tatsuya edited comment on ARROW-17072 at 7/15/22 10:05 AM:
-

Thank you both and I agree to close this in favor of ARROW-8324.
Perhaps after adding *_ipc_file functions, the various documents need to be 
updated to recommend the use of *_ipc_file functions.


was (Author: JIRAUSER280211):
Thank you both and I agree to close this in favor of ARROW-8324.
Perhaps after adding *_ipc_file functions, the various documents need to be 
updated to recommend the use of _ipc_file functions.

> [R] Rename *_feather functions
> --
>
> Key: ARROW-17072
> URL: https://issues.apache.org/jira/browse/ARROW-17072
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> As noted in the following document, the recommended extension for IPC files 
> is now `.arrow`.
> > We recommend the “.arrow” extension for files created with this format.
> https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
> However, the R library cannot read IPC files without using the `read_feather` 
> function after ARROW-16268.
> I think users will be confused if you keep using this function name because 
> the word `feather' is not associated with `.arrow` for beginners.
> For example, could we deprecate function `read_feather` and recommend another 
> function like `read_ipc_file`, which has the same functionality?
> Note: This ticket was created as a result of a conversation with [~kou] on 
> Twitter.
> https://twitter.com/ktou/status/1547373388687376386



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17072) [R] Rename *_feather functions

2022-07-14 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-17072:
-

 Summary: [R] Rename *_feather functions
 Key: ARROW-17072
 URL: https://issues.apache.org/jira/browse/ARROW-17072
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 8.0.0
Reporter: SHIMA Tatsuya


As noted in the following document, the recommended extension for IPC files is 
now `.arrow`.

> We recommend the “.arrow” extension for files created with this format.
https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format

However, the R library cannot read IPC files without using the `read_feather` 
function after ARROW-16268.
I think users will be confused if you keep using this function name because the 
word `feather' is not associated with `.arrow` for beginners.

For example, could we deprecate function `read_feather` and recommend another 
function like `read_ipc_file`, which has the same functionality?

Note: This ticket was created as a result of a conversation with [~kou] on 
Twitter.
https://twitter.com/ktou/status/1547373388687376386



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-15816) [R][Docs] pkgdown config refactoring

2022-04-19 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya closed ARROW-15816.
-
Resolution: Won't Fix

> [R][Docs] pkgdown config refactoring
> 
>
> Key: ARROW-15816
> URL: https://issues.apache.org/jira/browse/ARROW-15816
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Part of ARROW-15734
> Need to change the configuration of the pkgdown site which is not compatible 
> with bootstrap5.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16038) [R] different behavior from dplyr when mutate's `.keep` option is set

2022-03-27 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-16038:
-

 Summary: [R] different behavior from dplyr when mutate's `.keep` 
option is set
 Key: ARROW-16038
 URL: https://issues.apache.org/jira/browse/ARROW-16038
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 7.0.0
Reporter: SHIMA Tatsuya


The order of columns when `dplyr::mutate`'s `.keep` option is set to "none", 
etc. has been changed in dplyr 1.0.8 and differs from the current behavior of 
the arrow package.

For more information, please see the following issues.

https://github.com/tidyverse/dplyr/pull/6035
https://github.com/tidyverse/dplyr/issues/6086
https://github.com/tidyverse/dplyr/pull/6087

{code:r}
library(dplyr, warn.conflicts = FALSE)

df <- tibble::tibble(x = 1:3, y = 4:6)

df |>
  transmute(x, z = x + 1, y)
#> # A tibble: 3 × 3
#>   x z y
#> 
#> 1 1 2 4
#> 2 2 3 5
#> 3 3 4 6

df |>
  mutate(x, z = x + 1, y, .keep = "none")
#> # A tibble: 3 × 3
#>   x y z
#> 
#> 1 1 4 2
#> 2 2 5 3
#> 3 3 6 4

df |>
  arrow::arrow_table() |>
  mutate(x, z = x + 1, y, .keep = "none") |>
  collect()
#> # A tibble: 3 × 3
#>   x z y
#> 
#> 1 1 2 4
#> 2 2 3 5
#> 3 3 4 6
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15602) [R] can't read timestamp with timezone from CSV (or other delimited) file

2022-03-08 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502846#comment-17502846
 ] 

SHIMA Tatsuya commented on ARROW-15602:
---

I believe this is the format defined in ISO8601.
Don't you think libarrow's ISO8601 parser can handle this?
In fact, it seems to be handled by pyarrow. (Please see my comments above.)

> [R] can't read timestamp with timezone from CSV (or other delimited) file
> -
>
> Key: ARROW-15602
> URL: https://issues.apache.org/jira/browse/ARROW-15602
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: R version 4.1.2 (2021-11-01)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 20.04.3 LTS
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> The following values in a csv file can be read as timestamp by 
> `pyarrow.csv.read_csv` and `readr::read_csv`, but not by 
> `arrow::read_csv_arrow`.
> {code}
> "x"
> "2004-04-01T12:00+09:00"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15828) [Python][R] ChunkedArray's cast() method combine multiple arrays into one

2022-03-07 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502168#comment-17502168
 ] 

SHIMA Tatsuya commented on ARROW-15828:
---

Thanks for the reply.
My question was whether it is normal for the chunks to behave differently 
depending on the cast destination, whether they remain split into chunks or 
become one continuous chunk.
Does your guess explain that the chunks are only contiguous if the type of the 
cast destination is numeric?

I simply found this behavior while playing around and would appreciate it if 
you could close this if this is the intended behavior.

> [Python][R] ChunkedArray's cast() method combine multiple arrays into one
> -
>
> Key: ARROW-15828
> URL: https://issues.apache.org/jira/browse/ARROW-15828
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python, R
>Affects Versions: 7.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> It appears that if I try to cast to int or float, the array will be one.
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> #> See arrow_info() for available features
> chunked_array(1:2, 3:4, 5:6)$cast(string())
> #> ChunkedArray
> #> [
> #>   [
> #> "1",
> #> "2"
> #>   ],
> #>   [
> #> "3",
> #> "4"
> #>   ],
> #>   [
> #> "5",
> #> "6"
> #>   ]
> #> ]
> chunked_array(1:2, 3:4, 5:6)$cast(float64())
> #> ChunkedArray
> #> [
> #>   [
> #> 1,
> #> 2,
> #> 3,
> #> 4,
> #> 5,
> #> 6
> #>   ]
> #> ]
> chunked_array(1:2, 3:4, 5:6)$cast(int64())
> #> ChunkedArray
> #> [
> #>   [
> #> 1,
> #> 2,
> #> 3,
> #> 4,
> #> 5,
> #> 6
> #>   ]
> #> ]
> chunked_array(1:2, 3:4, 5:6)$cast(date32())
> #> ChunkedArray
> #> [
> #>   [
> #> 1970-01-02,
> #> 1970-01-03
> #>   ],
> #>   [
> #> 1970-01-04,
> #> 1970-01-05
> #>   ],
> #>   [
> #> 1970-01-06,
> #> 1970-01-07
> #>   ]
> #> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15828) [Python][R] ChunkedArray's cast() method combine multiple arrays into one

2022-03-03 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-15828:
--
Summary: [Python][R] ChunkedArray's cast() method combine multiple arrays 
into one  (was: [R] ChunkedArray$cast() combine multiple arrays into one)

> [Python][R] ChunkedArray's cast() method combine multiple arrays into one
> -
>
> Key: ARROW-15828
> URL: https://issues.apache.org/jira/browse/ARROW-15828
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python, R
>Affects Versions: 7.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> It appears that if I try to cast to int or float, the array will be one.
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> #> See arrow_info() for available features
> chunked_array(1:2, 3:4, 5:6)$cast(string())
> #> ChunkedArray
> #> [
> #>   [
> #> "1",
> #> "2"
> #>   ],
> #>   [
> #> "3",
> #> "4"
> #>   ],
> #>   [
> #> "5",
> #> "6"
> #>   ]
> #> ]
> chunked_array(1:2, 3:4, 5:6)$cast(float64())
> #> ChunkedArray
> #> [
> #>   [
> #> 1,
> #> 2,
> #> 3,
> #> 4,
> #> 5,
> #> 6
> #>   ]
> #> ]
> chunked_array(1:2, 3:4, 5:6)$cast(int64())
> #> ChunkedArray
> #> [
> #>   [
> #> 1,
> #> 2,
> #> 3,
> #> 4,
> #> 5,
> #> 6
> #>   ]
> #> ]
> chunked_array(1:2, 3:4, 5:6)$cast(date32())
> #> ChunkedArray
> #> [
> #>   [
> #> 1970-01-02,
> #> 1970-01-03
> #>   ],
> #>   [
> #> 1970-01-04,
> #> 1970-01-05
> #>   ],
> #>   [
> #> 1970-01-06,
> #> 1970-01-07
> #>   ]
> #> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15828) [R] ChunkedArray$cast() combine multiple arrays into one

2022-03-03 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-15828:
--
Component/s: Python

> [R] ChunkedArray$cast() combine multiple arrays into one
> 
>
> Key: ARROW-15828
> URL: https://issues.apache.org/jira/browse/ARROW-15828
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python, R
>Affects Versions: 7.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> It appears that if I try to cast to int or float, the array will be one.
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> #> See arrow_info() for available features
> chunked_array(1:2, 3:4, 5:6)$cast(string())
> #> ChunkedArray
> #> [
> #>   [
> #> "1",
> #> "2"
> #>   ],
> #>   [
> #> "3",
> #> "4"
> #>   ],
> #>   [
> #> "5",
> #> "6"
> #>   ]
> #> ]
> chunked_array(1:2, 3:4, 5:6)$cast(float64())
> #> ChunkedArray
> #> [
> #>   [
> #> 1,
> #> 2,
> #> 3,
> #> 4,
> #> 5,
> #> 6
> #>   ]
> #> ]
> chunked_array(1:2, 3:4, 5:6)$cast(int64())
> #> ChunkedArray
> #> [
> #>   [
> #> 1,
> #> 2,
> #> 3,
> #> 4,
> #> 5,
> #> 6
> #>   ]
> #> ]
> chunked_array(1:2, 3:4, 5:6)$cast(date32())
> #> ChunkedArray
> #> [
> #>   [
> #> 1970-01-02,
> #> 1970-01-03
> #>   ],
> #>   [
> #> 1970-01-04,
> #> 1970-01-05
> #>   ],
> #>   [
> #> 1970-01-06,
> #> 1970-01-07
> #>   ]
> #> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15828) [R] ChunkedArray$cast() combine multiple arrays into one

2022-03-03 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500850#comment-17500850
 ] 

SHIMA Tatsuya commented on ARROW-15828:
---

I noticed that this is reproduced in Python as well.
Is this the intended behavior?

{code:python}
>>> import pyarrow as pa
>>> pa.chunked_array([pa.array([1,2]),pa.array([3,4])]).cast(pa.float64())

[
  [
1,
2,
3,
4
  ]
]
>>> pa.chunked_array([pa.array([1,2]),pa.array([3,4])]).cast(pa.utf8())

[
  [
"1",
"2"
  ],
  [
"3",
"4"
  ]
{code}
 
 
 

 

> [R] ChunkedArray$cast() combine multiple arrays into one
> 
>
> Key: ARROW-15828
> URL: https://issues.apache.org/jira/browse/ARROW-15828
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 7.0.0
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> It appears that if I try to cast to int or float, the array will be one.
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> #> See arrow_info() for available features
> chunked_array(1:2, 3:4, 5:6)$cast(string())
> #> ChunkedArray
> #> [
> #>   [
> #> "1",
> #> "2"
> #>   ],
> #>   [
> #> "3",
> #> "4"
> #>   ],
> #>   [
> #> "5",
> #> "6"
> #>   ]
> #> ]
> chunked_array(1:2, 3:4, 5:6)$cast(float64())
> #> ChunkedArray
> #> [
> #>   [
> #> 1,
> #> 2,
> #> 3,
> #> 4,
> #> 5,
> #> 6
> #>   ]
> #> ]
> chunked_array(1:2, 3:4, 5:6)$cast(int64())
> #> ChunkedArray
> #> [
> #>   [
> #> 1,
> #> 2,
> #> 3,
> #> 4,
> #> 5,
> #> 6
> #>   ]
> #> ]
> chunked_array(1:2, 3:4, 5:6)$cast(date32())
> #> ChunkedArray
> #> [
> #>   [
> #> 1970-01-02,
> #> 1970-01-03
> #>   ],
> #>   [
> #> 1970-01-04,
> #> 1970-01-05
> #>   ],
> #>   [
> #> 1970-01-06,
> #> 1970-01-07
> #>   ]
> #> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15814) [R] Improve documentation for cast()

2022-03-02 Thread SHIMA Tatsuya (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500224#comment-17500224
 ] 

SHIMA Tatsuya commented on ARROW-15814:
---

[~dragosmg] Thank you for giving me the opportunity. I have created a minimum 
PR for now.
https://github.com/apache/arrow/pull/12546

Perhaps it is important too to touch on this in the article of dplyr as 
suggested in ARROW-14703.

> [R] Improve documentation for cast()
> 
>
> Key: ARROW-15814
> URL: https://issues.apache.org/jira/browse/ARROW-15814
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: good-first-issue, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Originated in the 
> [comments|https://issues.apache.org/jira/browse/ARROW-14820?focusedCommentId=17498465&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17498465]
>  for ARROW-14820.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15828) [R] ChunkedArray$cast() combine multiple arrays into one

2022-03-02 Thread SHIMA Tatsuya (Jira)
SHIMA Tatsuya created ARROW-15828:
-

 Summary: [R] ChunkedArray$cast() combine multiple arrays into one
 Key: ARROW-15828
 URL: https://issues.apache.org/jira/browse/ARROW-15828
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 7.0.0
Reporter: SHIMA Tatsuya


It appears that if I try to cast to int or float, the array will be one.

{code:r}
library(arrow, warn.conflicts = FALSE)
#> See arrow_info() for available features
chunked_array(1:2, 3:4, 5:6)$cast(string())
#> ChunkedArray
#> [
#>   [
#> "1",
#> "2"
#>   ],
#>   [
#> "3",
#> "4"
#>   ],
#>   [
#> "5",
#> "6"
#>   ]
#> ]
chunked_array(1:2, 3:4, 5:6)$cast(float64())
#> ChunkedArray
#> [
#>   [
#> 1,
#> 2,
#> 3,
#> 4,
#> 5,
#> 6
#>   ]
#> ]
chunked_array(1:2, 3:4, 5:6)$cast(int64())
#> ChunkedArray
#> [
#>   [
#> 1,
#> 2,
#> 3,
#> 4,
#> 5,
#> 6
#>   ]
#> ]
chunked_array(1:2, 3:4, 5:6)$cast(date32())
#> ChunkedArray
#> [
#>   [
#> 1970-01-02,
#> 1970-01-03
#>   ],
#>   [
#> 1970-01-04,
#> 1970-01-05
#>   ],
#>   [
#> 1970-01-06,
#> 1970-01-07
#>   ]
#> ]
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15734) [R][Docs] Enable searching R docs

2022-03-02 Thread SHIMA Tatsuya (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-15734:
--
Component/s: R

> [R][Docs] Enable searching R docs
> -
>
> Key: ARROW-15734
> URL: https://issues.apache.org/jira/browse/ARROW-15734
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: SHIMA Tatsuya
>Priority: Major
>  Labels: pull-request-available
> Attachments: bs5.png, fixed-bs5.png, 
> image-2022-03-01-00-33-12-050.png, image-2022-03-01-00-46-51-350.png, 
> updated-list.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Enable Bootstrap 5 in pkgdown website to use the built-in search feature.
> Do you have any plans to switch to Bootstrap 5?
> https://pkgdown.r-lib.org/articles/search.html
> https://pkgdown.r-lib.org/articles/customise.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


  1   2   >