[jira] [Created] (ARROW-18219) [R] read_csv_arrow fails when a string contains a backslash-escaped quote mark followed by a comma

2022-11-01 Thread Danielle Navarro (Jira)
Danielle Navarro created ARROW-18219:


 Summary: [R] read_csv_arrow fails when a string contains a 
backslash-escaped quote mark followed by a comma
 Key: ARROW-18219
 URL: https://issues.apache.org/jira/browse/ARROW-18219
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 10.0.0
Reporter: Danielle Navarro


`read_csv_arrow()` incorrectly parses CSV files when a string value contains a 
comma that appears after a backslash-escaped quote mark. Originally noted by 
Thomas Klebel https://scicomm.xyz/@tklebel/109270436511066953

This is an example that throws the error:

``` r
x <- tempfile()
readr::write_lines(
  '
id,text
1,"some text on \\"BLAH\\" and X, and Y also"
', x)

cat(system(paste('cat', x), intern = TRUE), sep = "\n")
#> 
#> id,text
#> 1,"some text on \"BLAH\" and X, and Y also"
arrow::read_csv_arrow(x, escape_backslash = TRUE)
#> Error:
#> ! Invalid: CSV parse error: Expected 2 columns, got 3: 1,"some text on 
\"BLAH\" and X, and Y also"

#> Backtrace:
#> ▆
#>  1. └─arrow (local) ``(file = x, escape_backslash = TRUE, delim = ",")
#>  2.   └─base::tryCatch(...) at r/R/csv.R:217:2
#>  3. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  4.   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  5. └─value[[3L]](cond)
#>  6.   └─arrow:::augment_io_error_msg(e, call, schema = schema) at 
r/R/csv.R:222:6
#>  7. └─rlang::abort(msg, call = call) at r/R/util.R:251:2
```

Created on 2022-11-02 with [reprex 
v2.0.2](https://reprex.tidyverse.org)

This version includes four lines that might potentially error but do not:

``` r
x <- tempfile()
readr::write_lines(
  '
id,text
2,"some text on X and Y"
3,"some text on X, and Y"
4,"some text on \\"BLAH\\"
5,"some text on X and Y, and \\"BLAH\\" also"
', x)

cat(system(paste('cat', x), intern = TRUE), sep = "\n")
#> 
#> id,text
#> 2,"some text on X and Y"
#> 3,"some text on X, and Y"
#> 4,"some text on \"BLAH\"
#> 5,"some text on X and Y, and \"BLAH\" also"
arrow::read_csv_arrow(x, escape_backslash = TRUE)
#> # A tibble: 4 × 2
#>  id text 
#>
#> 1 2 "some text on X and Y"   
#> 2 3 "some text on X, and Y"  
#> 3 4 "some text on \\BLAH\\\""
#> 4 5 "some text on X and Y, and \\BLAH\\\" also\""
```

Created on 2022-11-02 with [reprex 
v2.0.2](https://reprex.tidyverse.org)

I'm not sure if the problem is R specific. I've partially reproduced the error 
using reticulate and pyarrow as follows, but notice that this errors at a 
different point: the pyarrow version appears to fail with the comma preceding 
the backslash-escaped quote mark:

``` r
x <- tempfile()
readr::write_lines(
  '
id,text
1,"some text on X and Y"
2,"some text on X, and Y"
3,"some text on \\"BLAH\\"
4,"some text on X and Y, and \\"BLAH\\" also"
5,"some text on \\"BLAH\\" and X, and Y also"
', x)

cat(system(paste('cat', x), intern = TRUE), sep = "\n")
#> 
#> id,text
#> 1,"some text on X and Y"
#> 2,"some text on X, and Y"
#> 3,"some text on \"BLAH\"
#> 4,"some text on X and Y, and \"BLAH\" also"
#> 5,"some text on \"BLAH\" and X, and Y also"

csv <- reticulate::import("pyarrow.csv")
opt <- csv$ParseOptions(escape_char='\\')
csv$read_csv(x, parse_options = opt)
#> Error in py_call_impl(callable, dots$args, dots$keywords): 
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 2 columns, got 3: 3,"some 
text on \"BLAH\"
#> 4,"some text on X and Y, and \"BLAH\" also"
```

Created on 2022-11-02 with [reprex 
v2.0.2](https://reprex.tidyverse.org)






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18218) [R] read_fwf_arrow

2022-11-01 Thread Lucas Mation (Jira)
Lucas Mation created ARROW-18218:


 Summary: [R] read_fwf_arrow
 Key: ARROW-18218
 URL: https://issues.apache.org/jira/browse/ARROW-18218
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Lucas Mation


It would be great if `arrow` provided a function to read fixed-width file (FWF) 
formats. 

I have asked for help with this in two SO posts 
([here|http://example.com]https://stackoverflow.com/questions/74280697/dplyr-way-to-break-variable-into-multiple-columns-acording-to-layout-dictionar/74281380?noredirect=1#comment131145927_74281380
 and 
[here|http://example.com]https://stackoverflow.com/questions/74276222/r-arrow-how-to-read-fwf-format-in-r-using-arrow/74279929#74279929),
 but have not had much success so far.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18217) Multiple Filesystem subclasses are missing an override for Equals

2022-11-01 Thread Vyas Ramasubramani (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vyas Ramasubramani updated ARROW-18217:
---
Description: 
Currently the `Filesystem` class contains two overloads for the `Equals` method:

{{virtual bool Equals(const FileSystem& other) const = 0;}}
{{virtual bool Equals(const std::shared_ptr& other) const}}

{{{ return Equals(*other); }}}

The second is a trivial call to the first for ease of use. The first method is 
pure virtual and _must_ be overridden by subclasses. The problem is that 
overriding a single overload of a method also shadows all other overloads. As a 
result, it is no longer possible to call the `shared_ptr` version of the 
method. This appears to be the case for the `SubTreeFileSystem` and the 
`SlowFileSystem` in `filesystem.h` as well as the `S3FileSystem` in `s3fs.h`. 
There may be other classes with this problem as well, those are just the ones 
that I noticed. My guess is that what was intended here is to pull the method 
into the child class's namespace via a using declaration i.e. add `using 
FileSystem::Equals` to each child class.

  was:
Currently the `Filesystem` class contains two overloads for the `Equals` method:

```
virtual bool Equals(const FileSystem& other) const = 0;
virtual bool Equals(const std::shared_ptr& other) const {
return Equals(*other);
}
```

The second is a trivial call to the first for ease of use. The first method is 
pure virtual and _must_ be overridden by subclasses. The problem is that 
overriding a single overload of a method also shadows all other overloads. As a 
result, it is no longer possible to call the `shared_ptr` version of the 
method. This appears to be the case for the `SubTreeFileSystem` and the 
`SlowFileSystem` in `filesystem.h` as well as the `S3FileSystem` in `s3fs.h`. 
There may be other classes with this problem as well, those are just the ones 
that I noticed. My guess is that what was intended here is to pull the method 
into the child class's namespace via a using declaration i.e. add `using 
FileSystem::Equals` to each child class.


> Multiple Filesystem subclasses are missing an override for Equals
> -
>
> Key: ARROW-18217
> URL: https://issues.apache.org/jira/browse/ARROW-18217
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0, 10.0.0
>Reporter: Vyas Ramasubramani
>Priority: Minor
>
> Currently the `Filesystem` class contains two overloads for the `Equals` 
> method:
> {{virtual bool Equals(const FileSystem& other) const = 0;}}
> {{virtual bool Equals(const std::shared_ptr& other) const}}
> {{{ return Equals(*other); }}}
> The second is a trivial call to the first for ease of use. The first method 
> is pure virtual and _must_ be overridden by subclasses. The problem is that 
> overriding a single overload of a method also shadows all other overloads. As 
> a result, it is no longer possible to call the `shared_ptr` version of the 
> method. This appears to be the case for the `SubTreeFileSystem` and the 
> `SlowFileSystem` in `filesystem.h` as well as the `S3FileSystem` in `s3fs.h`. 
> There may be other classes with this problem as well, those are just the ones 
> that I noticed. My guess is that what was intended here is to pull the method 
> into the child class's namespace via a using declaration i.e. add `using 
> FileSystem::Equals` to each child class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18217) Multiple Filesystem subclasses are missing an override for Equals

2022-11-01 Thread Vyas Ramasubramani (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vyas Ramasubramani updated ARROW-18217:
---
Description: 
Currently the `Filesystem` class contains two overloads for the `Equals` method:

{{virtual bool Equals(const FileSystem& other) const = 0;}}
{{virtual bool Equals(const std::shared_ptr& other) const}}

{ return Equals(*other); }

The second is a trivial call to the first for ease of use. The first method is 
pure virtual and _must_ be overridden by subclasses. The problem is that 
overriding a single overload of a method also shadows all other overloads. As a 
result, it is no longer possible to call the `shared_ptr` version of the 
method. This appears to be the case for the `SubTreeFileSystem` and the 
`SlowFileSystem` in `filesystem.h` as well as the `S3FileSystem` in `s3fs.h`. 
There may be other classes with this problem as well, those are just the ones 
that I noticed. My guess is that what was intended here is to pull the method 
into the child class's namespace via a using declaration i.e. add `using 
FileSystem::Equals` to each child class.

  was:
Currently the `Filesystem` class contains two overloads for the `Equals` method:

{{virtual bool Equals(const FileSystem& other) const = 0;}}
{{virtual bool Equals(const std::shared_ptr& other) const}}

{{{ return Equals(*other); }}}

The second is a trivial call to the first for ease of use. The first method is 
pure virtual and _must_ be overridden by subclasses. The problem is that 
overriding a single overload of a method also shadows all other overloads. As a 
result, it is no longer possible to call the `shared_ptr` version of the 
method. This appears to be the case for the `SubTreeFileSystem` and the 
`SlowFileSystem` in `filesystem.h` as well as the `S3FileSystem` in `s3fs.h`. 
There may be other classes with this problem as well, those are just the ones 
that I noticed. My guess is that what was intended here is to pull the method 
into the child class's namespace via a using declaration i.e. add `using 
FileSystem::Equals` to each child class.


> Multiple Filesystem subclasses are missing an override for Equals
> -
>
> Key: ARROW-18217
> URL: https://issues.apache.org/jira/browse/ARROW-18217
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0, 10.0.0
>Reporter: Vyas Ramasubramani
>Priority: Minor
>
> Currently the `Filesystem` class contains two overloads for the `Equals` 
> method:
> {{virtual bool Equals(const FileSystem& other) const = 0;}}
> {{virtual bool Equals(const std::shared_ptr& other) const}}
> { return Equals(*other); }
> The second is a trivial call to the first for ease of use. The first method 
> is pure virtual and _must_ be overridden by subclasses. The problem is that 
> overriding a single overload of a method also shadows all other overloads. As 
> a result, it is no longer possible to call the `shared_ptr` version of the 
> method. This appears to be the case for the `SubTreeFileSystem` and the 
> `SlowFileSystem` in `filesystem.h` as well as the `S3FileSystem` in `s3fs.h`. 
> There may be other classes with this problem as well, those are just the ones 
> that I noticed. My guess is that what was intended here is to pull the method 
> into the child class's namespace via a using declaration i.e. add `using 
> FileSystem::Equals` to each child class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18217) Multiple Filesystem subclasses are missing an override for Equals

2022-11-01 Thread Vyas Ramasubramani (Jira)
Vyas Ramasubramani created ARROW-18217:
--

 Summary: Multiple Filesystem subclasses are missing an override 
for Equals
 Key: ARROW-18217
 URL: https://issues.apache.org/jira/browse/ARROW-18217
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 10.0.0, 9.0.0
Reporter: Vyas Ramasubramani


Currently the `Filesystem` class contains two overloads for the `Equals` method:

```
virtual bool Equals(const FileSystem& other) const = 0;
virtual bool Equals(const std::shared_ptr& other) const {
return Equals(*other);
}
```

The second is a trivial call to the first for ease of use. The first method is 
pure virtual and _must_ be overridden by subclasses. The problem is that 
overriding a single overload of a method also shadows all other overloads. As a 
result, it is no longer possible to call the `shared_ptr` version of the 
method. This appears to be the case for the `SubTreeFileSystem` and the 
`SlowFileSystem` in `filesystem.h` as well as the `S3FileSystem` in `s3fs.h`. 
There may be other classes with this problem as well, those are just the ones 
that I noticed. My guess is that what was intended here is to pull the method 
into the child class's namespace via a using declaration i.e. add `using 
FileSystem::Equals` to each child class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18207) Rubygems not updating in concert with majors

2022-11-01 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627379#comment-17627379
 ] 

Kouhei Sutou commented on ARROW-18207:
--

I think that we can't decide it without data.

BTW, the problem can be fixed by improving the extpp gem. So I've improved and 
released a new extpp gem.

> Rubygems not updating in concert with majors
> 
>
> Key: ARROW-18207
> URL: https://issues.apache.org/jira/browse/ARROW-18207
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 10.0.0
>Reporter: Noah Horton
>Assignee: Kouhei Sutou
>Priority: Major
>
> 10.0.0 just released, meaning that that all install scripts that use the 
> 'latest' tag are getting it.
> Yet rubygems.org is still running with the 9.0.0 version a week after 10.0.0 
> released.
> The build scripts need to start updating rubygems.org automatically, or guide 
> users to a bundler config like 
> {code:ruby}
> gem "red-arrow", github: "apache/arrow", glob: "ruby/red-arrow/*.gemspec", 
> require: "arrow", tag: 'apache-arrow-10.0.0'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18215) [R] User experience improvements

2022-11-01 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-18215:
-
Description: Umbrella ticket to collect together tickets relating to 
improving error messages, and general user experience tweaks  (was: Umbrella 
ticket to collect together tickets relating to improving error messages, and 
general dev-experience tweaks)

> [R] User experience improvements
> 
>
> Key: ARROW-18215
> URL: https://issues.apache.org/jira/browse/ARROW-18215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> Umbrella ticket to collect together tickets relating to improving error 
> messages, and general user experience tweaks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18199) [R] Misleading error message in query using across()

2022-11-01 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane reassigned ARROW-18199:


Assignee: (was: Nicola Crane)

> [R] Misleading error message in query using across()
> 
>
> Key: ARROW-18199
> URL: https://issues.apache.org/jira/browse/ARROW-18199
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Priority: Critical
>
> Error handling looks like it's happening in the wrong place - a comma has 
> been missed in the {{select()}} but it's wrongly appearing like it's an issue 
> with {{across()}}.  Can we do something to make this not happen?
> {code:r}
> download.file(
>   url = 
> "https://github.com/djnavarro/arrow-user2022/releases/download/v0.1/nyc-taxi-tiny.zip;,
>   destfile = here::here("data/nyc-taxi-tiny.zip")
> )
> library(arrow)
> library(dplyr)
> open_dataset("data") %>%
>   select(pickup_datetime, pickup_longitude, pickup_latitude 
> ends_with("amount")) %>%
>   mutate(across(ends_with("amount"), ~.x * 0.87, .names = "{.col}_gbp")) %>%
>   collect()
> {code}
> {code:r}
> Error in `across()`:
> ! Must be used inside dplyr verbs.
> Run `rlang::last_error()` to see where the error occurred.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18200) [R] Misleading error message if opening CSV dataset with invalid file in directory

2022-11-01 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane reassigned ARROW-18200:


Assignee: (was: Nicola Crane)

> [R] Misleading error message if opening CSV dataset with invalid file in 
> directory
> --
>
> Key: ARROW-18200
> URL: https://issues.apache.org/jira/browse/ARROW-18200
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> I made a mistake before where I thought a dataset contained CSVs which were, 
> in fact, Parquet files, but the error message I got was super unhelpful
> {code:r}
> library(arrow)
> download.file(
>   url = 
> "https://github.com/djnavarro/arrow-user2022/releases/download/v0.1/nyc-taxi-tiny.zip;,
>   destfile = here::here("data/nyc-taxi-tiny.zip")
> )
>  # (unzip the zip file into the data directory but don't delete it after)
> open_dataset("data", format = "csv")
> {code}
> {code:r}
> Error in nchar(x) : invalid multibyte string, element 1
> In addition: Warning message:
> In grepl("No match for FieldRef.Name(__filename)", msg, fixed = TRUE) :
>   input string 1 is invalid in this locale
> {code}
> Note, this only occurs with {{format="csv"}} and omitting this argument (i.e. 
> the default of {{format="parquet"}} leaves us with the much better error:
> {code:r}
> Error in `open_dataset()`:
> ! Invalid: Error creating dataset. Could not read schema from 
> '/home/nic2/arrow_10_twitter/data/nyc-taxi-tiny.zip': Could not open Parquet 
> input source '/home/nic2/arrow_10_twitter/data/nyc-taxi-tiny.zip': Parquet 
> magic bytes not found in footer. Either the file is corrupted or this is not 
> a parquet file.
> /home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:338  GetReader(source, 
> scan_options). Is this a 'parquet' file?
> /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:44  
> InspectSchemas(std::move(options))
> /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:265  
> Inspect(options.inspect_options)
> ℹ Did you mean to specify a 'format' other than the default (parquet)?
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17607) [R] Add as_scalar()

2022-11-01 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17607:
-
Parent: ARROW-18215
Issue Type: Sub-task  (was: New Feature)

> [R] Add as_scalar()
> ---
>
> Key: ARROW-17607
> URL: https://issues.apache.org/jira/browse/ARROW-17607
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>
> There's as_everything_else but not as_scalar. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18216) [R] Better error message when creating an array from decimals

2022-11-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18216:


 Summary: [R] Better error message when creating an array from 
decimals
 Key: ARROW-18216
 URL: https://issues.apache.org/jira/browse/ARROW-18216
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane


We should first check why this doesn't work, and if we can instead fix the 
problem instead of the error message

{code:r}
> ChunkedArray$create(c(1.4, 525.5), type = decimal(precision = 1, scale = 3))
Error: NotImplemented: Extend
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18200) [R] Misleading error message if opening CSV dataset with invalid file in directory

2022-11-01 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-18200:
-
Parent: ARROW-18215
Issue Type: Sub-task  (was: Bug)

> [R] Misleading error message if opening CSV dataset with invalid file in 
> directory
> --
>
> Key: ARROW-18200
> URL: https://issues.apache.org/jira/browse/ARROW-18200
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
>
> I made a mistake before where I thought a dataset contained CSVs which were, 
> in fact, Parquet files, but the error message I got was super unhelpful
> {code:r}
> library(arrow)
> download.file(
>   url = 
> "https://github.com/djnavarro/arrow-user2022/releases/download/v0.1/nyc-taxi-tiny.zip;,
>   destfile = here::here("data/nyc-taxi-tiny.zip")
> )
>  # (unzip the zip file into the data directory but don't delete it after)
> open_dataset("data", format = "csv")
> {code}
> {code:r}
> Error in nchar(x) : invalid multibyte string, element 1
> In addition: Warning message:
> In grepl("No match for FieldRef.Name(__filename)", msg, fixed = TRUE) :
>   input string 1 is invalid in this locale
> {code}
> Note, this only occurs with {{format="csv"}} and omitting this argument (i.e. 
> the default of {{format="parquet"}} leaves us with the much better error:
> {code:r}
> Error in `open_dataset()`:
> ! Invalid: Error creating dataset. Could not read schema from 
> '/home/nic2/arrow_10_twitter/data/nyc-taxi-tiny.zip': Could not open Parquet 
> input source '/home/nic2/arrow_10_twitter/data/nyc-taxi-tiny.zip': Parquet 
> magic bytes not found in footer. Either the file is corrupted or this is not 
> a parquet file.
> /home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:338  GetReader(source, 
> scan_options). Is this a 'parquet' file?
> /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:44  
> InspectSchemas(std::move(options))
> /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:265  
> Inspect(options.inspect_options)
> ℹ Did you mean to specify a 'format' other than the default (parquet)?
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18176) [R] arrow::open_dataset %>% select(myvars) %>% collect causes memory leak

2022-11-01 Thread Lucas Mation (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Mation updated ARROW-18176:
-
Description: 
I first posted on StackOverlow, 
[here.|https://stackoverflow.com/questions/74221492/r-arrow-open-dataset-selectmyvars-collect-causing-memory-leak]

I am having trouble using arrow in R. First, I saved some {{data.tables}} that 
were about 50-60Gb ({{{}d{}}} in the code chunk) in memory to a parquet file 
using:
 
{{d %>% write_dataset(f, format='parquet') # f is the directory name}}

Then I try to read open the file, select the relevant variables and
 
{{tic()d2 <- open_dataset(f) %>% select(all_of(myvars)) %>% collect #myvars is 
a vector of variable namestoc()}}

I did this conversion for 3 sets of data.tables (unfortunately, data is 
confidential so I can't include in the example). In one set, I was able to 
{{open>select>collect}} the desired table in about 60s, obtaining a 10Gb file 
(after variable selection).

For the other two sets, the command caused a memory leak. tic()-toc() returned 
after 80s. But the object name (d2) never appeared in Rstudio's "Enviroment 
panel", and memory used keeps creeping up until it occupied most of the 
available RAM of the server, and then R crashed. Note the orginal dataset, 
without subsetting cols, was smaller than 60Gb and the server had 512GB.

Any ideas on what could be going on here?

UPDATE: today I noticed a few more things.

1) If the collected object is small enough (3 cols, 66million rows), R will 
unfreeze. The console becomes responsive, the object shows up in the 
Environment panel. But memory use keeps going up (by small amounts because the 
underlying that is small). While this is helpening, issuing a gc() command 
reduces the memory use, but it then starts growing again.

2) Even after "rm(d2)" and  "gc()", the R session that issued the arrow 
commands still use around 60-70Gb of RAM... The only way to end that is to 
close the R session. 

3) I am using arrow 10.0.0

 

 

 

  was:
I first posted on StackOverlow, 
[here.|https://stackoverflow.com/questions/74221492/r-arrow-open-dataset-selectmyvars-collect-causing-memory-leak]

I am having trouble using arrow in R. First, I saved some {{data.tables}} that 
were about 50-60Gb ({{{}d{}}} in the code chunk) in memory to a parquet file 
using:
 
{{d %>% write_dataset(f, format='parquet') # f is the directory name}}

Then I try to read open the file, select the relevant variables and
 
{{tic()d2 <- open_dataset(f) %>% select(all_of(myvars)) %>% collect #myvars is 
a vector of variable namestoc()}}

I did this conversion for 3 sets of data.tables (unfortunately, data is 
confidential so I can't include in the example). In one set, I was able to 
{{open>select>collect}} the desired table in about 60s, obtaining a 10Gb file 
(after variable selection).

For the other two sets, the command caused a memory leak. tic()-toc() returned 
after 80s. But the object name (d2) never appeared in Rstudio's "Enviroment 
panel", and memory used keeps creeping up until it occupied most of the 
available RAM of the server, and then R crashed. Note the orginal dataset, 
without subsetting cols, was smaller than 60Gb and the server had 512GB.

Any ideas on what could be going on here?

UPDATE: today I noticed a few more things.

1) If the collected object is small enough (3 cols, 66million rows), R will 
unfreeze. The console becomes responsive, the object shows up in the 
Environment panel. But memory use keeps going up (by small amounts because the 
underlying that is small). While this is helpening, issuing a gc() command 
reduces the memory use, but it then starts growing again.

2) Even after "rm(d2)" and  "gc()", the R session that issued the arrow 
commands still use around 60-70Gb of RAM... The only way to end that is to 
close the R session. 

 

 

 

 


> [R] arrow::open_dataset %>% select(myvars) %>% collect causes memory leak
> -
>
> Key: ARROW-18176
> URL: https://issues.apache.org/jira/browse/ARROW-18176
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Priority: Critical
>
> I first posted on StackOverlow, 
> [here.|https://stackoverflow.com/questions/74221492/r-arrow-open-dataset-selectmyvars-collect-causing-memory-leak]
> I am having trouble using arrow in R. First, I saved some {{data.tables}} 
> that were about 50-60Gb ({{{}d{}}} in the code chunk) in memory to a parquet 
> file using:
>  
> {{d %>% write_dataset(f, format='parquet') # f is the directory name}}
> Then I try to read open the file, select the relevant variables and
>  
> {{tic()d2 <- open_dataset(f) %>% select(all_of(myvars)) %>% collect #myvars 
> is a vector of variable namestoc()}}
> I did this conversion for 3 sets of data.tables (unfortunately, data is 
> 

[jira] [Created] (ARROW-18215) [R] User experience improvements

2022-11-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18215:


 Summary: [R] User experience improvements
 Key: ARROW-18215
 URL: https://issues.apache.org/jira/browse/ARROW-18215
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


Umbrella ticket to collect together tickets relating to improving error 
messages, and general dev-experience tweaks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18199) [R] Misleading error message in query using across()

2022-11-01 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-18199:
-
Parent: ARROW-18215
Issue Type: Sub-task  (was: Bug)

> [R] Misleading error message in query using across()
> 
>
> Key: ARROW-18199
> URL: https://issues.apache.org/jira/browse/ARROW-18199
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Critical
>
> Error handling looks like it's happening in the wrong place - a comma has 
> been missed in the {{select()}} but it's wrongly appearing like it's an issue 
> with {{across()}}.  Can we do something to make this not happen?
> {code:r}
> download.file(
>   url = 
> "https://github.com/djnavarro/arrow-user2022/releases/download/v0.1/nyc-taxi-tiny.zip;,
>   destfile = here::here("data/nyc-taxi-tiny.zip")
> )
> library(arrow)
> library(dplyr)
> open_dataset("data") %>%
>   select(pickup_datetime, pickup_longitude, pickup_latitude 
> ends_with("amount")) %>%
>   mutate(across(ends_with("amount"), ~.x * 0.87, .names = "{.col}_gbp")) %>%
>   collect()
> {code}
> {code:r}
> Error in `across()`:
> ! Must be used inside dplyr verbs.
> Run `rlang::last_error()` to see where the error occurred.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18051) [C++] Enable tests skipped by ARROW-16392

2022-11-01 Thread Vibhatha Lakmal Abeykoon (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vibhatha Lakmal Abeykoon reassigned ARROW-18051:


Assignee: Vibhatha Lakmal Abeykoon

> [C++] Enable tests skipped by ARROW-16392
> -
>
> Key: ARROW-18051
> URL: https://issues.apache.org/jira/browse/ARROW-18051
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> There are a number of unit tests that we still skip (on Windows) due to 
> ARROW-16392.  However, ARROW-16392 has been fixed.  There is no reason to 
> skip these any longer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18183) [C++] cpp-micro benchmarks are failing on mac arm machine

2022-11-01 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace reassigned ARROW-18183:
---

Assignee: Weston Pace

> [C++] cpp-micro benchmarks are failing on mac arm machine
> -
>
> Key: ARROW-18183
> URL: https://issues.apache.org/jira/browse/ARROW-18183
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Benchmarking, C++
>Reporter: Elena Henderson
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18183) [C++]cpp-micro benchmarks are failing on mac arm machine

2022-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18183:
---
Labels: pull-request-available  (was: )

> [C++]cpp-micro benchmarks are failing on mac arm machine
> 
>
> Key: ARROW-18183
> URL: https://issues.apache.org/jira/browse/ARROW-18183
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Benchmarking, C++
>Reporter: Elena Henderson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18183) [C++] cpp-micro benchmarks are failing on mac arm machine

2022-11-01 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-18183:

Summary: [C++] cpp-micro benchmarks are failing on mac arm machine  (was: 
[C++]cpp-micro benchmarks are failing on mac arm machine)

> [C++] cpp-micro benchmarks are failing on mac arm machine
> -
>
> Key: ARROW-18183
> URL: https://issues.apache.org/jira/browse/ARROW-18183
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Benchmarking, C++
>Reporter: Elena Henderson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18183) [C++]cpp-micro benchmarks are failing on mac arm machine

2022-11-01 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-18183:

Summary: [C++]cpp-micro benchmarks are failing on mac arm machine  (was: 
cpp-micro benchmarks are failing on mac arm machine)

> [C++]cpp-micro benchmarks are failing on mac arm machine
> 
>
> Key: ARROW-18183
> URL: https://issues.apache.org/jira/browse/ARROW-18183
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Benchmarking, C++
>Reporter: Elena Henderson
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18148) [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather

2022-11-01 Thread Danielle Navarro (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627350#comment-17627350
 ] 

Danielle Navarro commented on ARROW-18148:
--

Agreed. For the current PR I'll write it as though the API weren't changing, 
but will still preference the term "Arrow" over "Feather" where that's relevant 

> [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather
> --
>
> Key: ARROW-18148
> URL: https://issues.apache.org/jira/browse/ARROW-18148
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Reporter: Stephanie Hazlitt
>Priority: Minor
>  Labels: feather
>
> Following up from [this mailing list 
> conversation|https://lists.apache.org/thread/nxncph842h8tyovxp04hrzq4y35lq4xq],
>  I am wondering if the R package should rename `read_ipc_file()` / 
> write_ipc_file()` to `read_arrow_file()`/ `write_arrow_file()`, or add an 
> additional alias for both. It might also be helpful to update the 
> documentation so that users read "Write an Arrow file (formerly known as a 
> Feather file)" rather than the current Feather-named first approach, assuming 
> there is a community decision to coalesce around the name Arrow for the file 
> format, and the project is moving on from the name Feather.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18114) [R] unify_schemas=FALSE does not improve open_dataset() read times

2022-11-01 Thread Carl Boettiger (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627339#comment-17627339
 ] 

Carl Boettiger commented on ARROW-18114:


Thanks Weston! Any update here?

> [R] unify_schemas=FALSE does not improve open_dataset() read times
> --
>
> Key: ARROW-18114
> URL: https://issues.apache.org/jira/browse/ARROW-18114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Carl Boettiger
>Priority: Major
>
> open_dataset() provides the very helpful optional argument to set 
> unify_schemas=FALSE, which should allow arrow to inspect a single parquet 
> file instead of touching potentially thousands or more parquet files to 
> determine a consistent unified schema.  This ought to provide a substantial 
> performance increase in contexts where the schema is known in advance.
> Unfortunately, in my tests it seems to have no impact on performance.  
> Consider the following reprexes:
>  default, unify_schemas=TRUE 
> {code:java}
> library(arrow)
> ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", 
> endpoint_override = "data.ecoforecast.org", anonymous=TRUE)
> bench::bench_time(
> { open_dataset(ex) }
> ){code}
> about 32 seconds for me.
>  manual, unify_schemas=FALSE:  
> {code:java}
> bench::bench_time({
> open_dataset(ex, unify_schemas = FALSE)
> }){code}
> takes about 32 seconds as well. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18214) [R] Use ISO 8601 in character representations of datetimes?

2022-11-01 Thread Carl Boettiger (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Boettiger updated ARROW-18214:
---
Description: 
Arrow needs to represent datetime / timestamp values as character strings, e.g. 
when writing to CSV or when generating partitions on timestamp-valued column. 
When this occurs, Arrow generates a string such as:
"2022-11-01 21:12:46.771925+"
In particular, this uses a space instead of a T between the date and time 
components.  I believe either is permitted in [RFC 
3339|https://www.rfc-editor.org/rfc/rfc3339.html#section-5] 

??5.6. NOTE: ISO 8601 defines date and time separated by "T". Applications 
using this syntax may choose, for the sake of readability, to specify a 
full-date and full-time separated by (say) a space character.??

 

But as RFC 3339 notes, this is not valid under ISO 8601.  It would be 
preferable to stick to the stricter ISO 8601 convention.

  was:
Arrow needs to represent datetime / timestamp values as character strings, e.g. 
when writing to CSV or when generating partitions on timestamp-valued column. 
When this occurs, Arrow generates a string such as:
"2022-11-01 21:12:46.771925+"
In particular, this uses a space instead of a T between the date and time 
components.  I believe either is permitted in [RFC 
3339|https://www.rfc-editor.org/rfc/rfc3339.html#section-5] 

??5.6. NOTE: ISO 8601 defines date and time separated by "T". Applications 
using this syntax may choose, for the sake of readability, to specify a 
full-date and full-time separated by (say) a space character.??

 

But as RFC 3339 notes, this is not valid under ISO 8601.  It would be 
preferable to stick to the stricter ISO 8601 convention. This would be more 
consistent with other software.


> [R] Use ISO 8601 in character representations of datetimes?
> ---
>
> Key: ARROW-18214
> URL: https://issues.apache.org/jira/browse/ARROW-18214
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Carl Boettiger
>Priority: Major
>
> Arrow needs to represent datetime / timestamp values as character strings, 
> e.g. when writing to CSV or when generating partitions on timestamp-valued 
> column. When this occurs, Arrow generates a string such as:
> "2022-11-01 21:12:46.771925+"
> In particular, this uses a space instead of a T between the date and time 
> components.  I believe either is permitted in [RFC 
> 3339|https://www.rfc-editor.org/rfc/rfc3339.html#section-5] 
> ??5.6. NOTE: ISO 8601 defines date and time separated by "T". Applications 
> using this syntax may choose, for the sake of readability, to specify a 
> full-date and full-time separated by (say) a space character.??
>  
> But as RFC 3339 notes, this is not valid under ISO 8601.  It would be 
> preferable to stick to the stricter ISO 8601 convention.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18214) [R] Use ISO 8601 in character representations of datetimes?

2022-11-01 Thread Carl Boettiger (Jira)
Carl Boettiger created ARROW-18214:
--

 Summary: [R] Use ISO 8601 in character representations of 
datetimes?
 Key: ARROW-18214
 URL: https://issues.apache.org/jira/browse/ARROW-18214
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Carl Boettiger


Arrow needs to represent datetime / timestamp values as character strings, e.g. 
when writing to CSV or when generating partitions on timestamp-valued column. 
When this occurs, Arrow generates a string such as:
"2022-11-01 21:12:46.771925+"
In particular, this uses a space instead of a T between the date and time 
components.  I believe either is permitted in [RFC 
3339|https://www.rfc-editor.org/rfc/rfc3339.html#section-5] 

??5.6. NOTE: ISO 8601 defines date and time separated by "T". Applications 
using this syntax may choose, for the sake of readability, to specify a 
full-date and full-time separated by (say) a space character.??

 

But as RFC 3339 notes, this is not valid under ISO 8601.  It would be 
preferable to stick to the stricter ISO 8601 convention. This would be more 
consistent with other software.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17640) [C++] Add File Handling Test cases for GlobFile handling in Substrait Read

2022-11-01 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace resolved ARROW-17640.
-
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14132
[https://github.com/apache/arrow/pull/14132]

> [C++] Add File Handling Test cases for GlobFile handling in Substrait Read
> --
>
> Key: ARROW-17640
> URL: https://issues.apache.org/jira/browse/ARROW-17640
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> At the moment the `GlobFiles` function hasn't been tested with an end-to-end 
> Substrait-To-Arrow case. Also observed that leading slash is ignored in this 
> API. Proposed changes are adding test cases to cover the fix including 
> end-to-end test cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND

2022-11-01 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627330#comment-17627330
 ] 

Kouhei Sutou commented on ARROW-17374:
--

Could you provide command lines to reproduce this with conda?

> [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
> --
>
> Key: ARROW-17374
> URL: https://issues.apache.org/jira/browse/ARROW-17374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0, 8.0.1, 9.0.0
> Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64
>Reporter: Shane Brennan
>Priority: Blocker
> Attachments: build-images.out
>
>
> I've been trying to install Arrow on an R notebook within AWS SageMaker. 
> SageMaker provides Jupyter-like notebooks, with each instance running Amazon 
> Linux 2 as its OS, itself based on RHEL. 
> Trying to install a few ways, e.g., using the standard binaries, using the 
> nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all 
> still result in the following error. 
> {noformat}
> x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared 
> -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common 
> -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags 
> -Wl,--gc-sections -Wl,--allow-shlib-undefined 
> -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib 
> -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib 
> -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o 
> array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o 
> compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o 
> expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o 
> json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o 
> recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o 
> schema.o symbols.o table.o threadpool.o type_infer.o 
> -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib
>  -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz 
> SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread 
> -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto 
> -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR
> x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or 
> directory
> make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: 
> arrow.so] Error 1{noformat}
> Snappy is installed on the systems, and both shared object (.so) and cmake 
> files are there, where I've tried setting the system env variables Snappy_DIR 
> and Snappy_LIB to point at them, but to no avail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18161) [Ruby] Tables can have buffers get GC'ed

2022-11-01 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-18161:
-
Summary: [Ruby] Tables can have buffers get GC'ed  (was: Ruby Arrow Tables 
can have buffers get GC'ed)

> [Ruby] Tables can have buffers get GC'ed
> 
>
> Key: ARROW-18161
> URL: https://issues.apache.org/jira/browse/ARROW-18161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 9.0.0
> Environment: Ruby 3.1.2
>Reporter: Noah Horton
>Assignee: Kouhei Sutou
>Priority: Major
>
> ven an Arrow::Table with several columns "X"
>  
> {code:ruby}
> # Rails console outputs
> 3.1.2 :107 > x.schema
>  => 
> # dates: date32[day]                             
> expected_values: double>                       
> 3.1.2 :108 > x.schema
>  => 
> # dates: date32[day]                             
> expected_values: double>                       
> 3.1.2 :109 >  {code}
> Note that the object and pointer have both changed values.
> But the far bigger issue is that repeated reads from it will cause different 
> results:
> {code:ruby}
> 3.1.2 :097 > x[1][0]
>  => Sun, 22 Aug 2021 
> 3.1.2 :098 > x[1][1]
>  => nil 
> 3.1.2 :099 > x[1][0]
>  => nil {code}
> I have a lot of issues like this - when I have done these types of read 
> operations, I get the original table with the data in the columns all 
> shuffled around or deleted. 
> I do ingest the data slightly oddly in the first place as it comes in over 
> GRPC and I am using Arrow::Buffer to read it from the GRPC and then passing 
> that into Arrow::Table.load. But I would not expect that once it was in 
> Arrow::Table that I could do anything to permute it unintentionally.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18186) [C++][MinGW] Fail to build with clang

2022-11-01 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-18186.
--
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14536
[https://github.com/apache/arrow/pull/14536]

> [C++][MinGW] Fail to build with clang
> -
>
> Key: ARROW-18186
> URL: https://issues.apache.org/jira/browse/ARROW-18186
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> https://github.com/kou/arrow/actions/runs/3342340048/jobs/5534465173#step:7:768
> {noformat}
> FAILED: src/arrow/CMakeFiles/arrow_shared.dir/util/int_util.cc.obj 
> D:\a\_temp\msys64\clang64\bin\ccache.exe 
> D:\a\_temp\msys64\clang64\bin\c++.exe -DARROW_EXPORTING 
> -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_BMI2 
> -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 -DARROW_WITH_BROTLI 
> -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_RE2 -DARROW_WITH_SNAPPY 
> -DARROW_WITH_UTF8PROC -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD 
> -DAWS_AUTH_USE_IMPORT_EXPORT -DAWS_CAL_USE_IMPORT_EXPORT 
> -DAWS_CHECKSUMS_USE_IMPORT_EXPORT -DAWS_COMMON_USE_IMPORT_EXPORT 
> -DAWS_COMPRESSION_USE_IMPORT_EXPORT -DAWS_CRT_CPP_USE_IMPORT_EXPORT 
> -DAWS_EVENT_STREAM_USE_IMPORT_EXPORT -DAWS_HTTP_USE_IMPORT_EXPORT 
> -DAWS_IO_USE_IMPORT_EXPORT -DAWS_MQTT_USE_IMPORT_EXPORT 
> -DAWS_MQTT_WITH_WEBSOCKETS -DAWS_S3_USE_IMPORT_EXPORT 
> -DAWS_SDKUTILS_USE_IMPORT_EXPORT -DAWS_SDK_VERSION_MAJOR=1 
> -DAWS_SDK_VERSION_MINOR=9 -DAWS_SDK_VERSION_PATCH=367 
> -DAWS_USE_IO_COMPLETION_PORTS -DBOOST_ALL_DYN_LINK -DBOOST_ALL_NO_LIB 
> -DURI_STATIC_BUILD -DUSE_IMPORT_EXPORT -DUSE_IMPORT_EXPORT=1 
> -DUSE_WINDOWS_DLL_SEMANTICS -D_CRT_SECURE_NO_WARNINGS 
> -D_ENABLE_EXTENDED_ALIGNED_STORAGE -Darrow_shared_EXPORTS 
> -ID:/a/arrow/arrow/build/cpp/src -ID:/a/arrow/arrow/cpp/src 
> -ID:/a/arrow/arrow/cpp/src/generated -isystem 
> D:/a/arrow/arrow/cpp/thirdparty/flatbuffers/include -isystem 
> D:/a/arrow/arrow/cpp/thirdparty/hadoop/include -isystem 
> D:/a/arrow/arrow/build/cpp/google_cloud_cpp_ep-install/include -isystem 
> D:/a/arrow/arrow/build/cpp/crc32c_ep-install/include -Qunused-arguments 
> -fcolor-diagnostics -O2 -DNDEBUG  -Wa,-mbig-obj -Wall -Wextra -Wdocumentation 
> -Wshorten-64-to-32 -Wno-missing-braces -Wno-unused-parameter 
> -Wno-constant-logical-operand -Wno-return-stack-address 
> -Wno-unknown-warning-option -Wno-pass-failed -mxsave -msse4.2   -DNDEBUG 
> -pthread -std=c++17 -MD -MT 
> src/arrow/CMakeFiles/arrow_shared.dir/util/int_util.cc.obj -MF 
> src\arrow\CMakeFiles\arrow_shared.dir\util\int_util.cc.obj.d -o 
> src/arrow/CMakeFiles/arrow_shared.dir/util/int_util.cc.obj -c 
> D:/a/arrow/arrow/cpp/src/arrow/util/int_util.cc
> D:/a/arrow/arrow/cpp/src/arrow/util/int_util.cc:463:1: error: an attribute 
> list cannot appear here
> INSTANTIATE_ALL()
> ^
> D:/a/arrow/arrow/cpp/src/arrow/util/int_util.cc:454:3: note: expanded from 
> macro 'INSTANTIATE_ALL'
>   INSTANTIATE_ALL_DEST(uint8_t)  \
>   ^
> D:/a/arrow/arrow/cpp/src/arrow/util/int_util.cc:444:3: note: expanded from 
> macro 'INSTANTIATE_ALL_DEST'
>   INSTANTIATE(uint8_t, DEST)   \
>   ^~
> D:/a/arrow/arrow/cpp/src/arrow/util/int_util.cc:440:12: note: expanded from 
> macro 'INSTANTIATE'
>   template ARROW_TEMPLATE_EXPORT void TransposeInts( \
>^
> D:/a/arrow/arrow/cpp/src/arrow/util/visibility.h:47:31: note: expanded from 
> macro 'ARROW_TEMPLATE_EXPORT'
> #define ARROW_TEMPLATE_EXPORT ARROW_DLLEXPORT
>   ^~~
> D:/a/arrow/arrow/cpp/src/arrow/util/visibility.h:32:25: note: expanded from 
> macro 'ARROW_DLLEXPORT'
> #define ARROW_DLLEXPORT [[gnu::dllexport]]
> ^~
> ...
> [127/801] Building CXX object 
> src/arrow/CMakeFiles/arrow_shared.dir/util/io_util.cc.obj
> D:/a/arrow/arrow/cpp/src/arrow/util/io_util.cc:1079:7: warning: variable 
> 'oflag' set but not used [-Wunused-but-set-variable]
>   int oflag = _O_CREAT | _O_BINARY | _O_NOINHERIT;
>   ^
> D:/a/arrow/arrow/cpp/src/arrow/util/io_util.cc:1545:29: warning: missing 
> field 'InternalHigh' initializer [-Wmissing-field-initializers]
>   OVERLAPPED overlapped = {0};
> ^
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17288) [C++] Create fragment scanners for csv/parquet/orc/ipc

2022-11-01 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-17288:
-

Assignee: (was: Weston Pace)

> [C++] Create fragment scanners for csv/parquet/orc/ipc
> --
>
> Key: ARROW-17288
> URL: https://issues.apache.org/jira/browse/ARROW-17288
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> Once we have the basic scan node ready (with an initial implementation based 
> on in-memory fragments) then we can add over the file-format versions.  We 
> may also want to consider adding JSON support for datasets at this time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18213) [R] Arrow 10 silently dropping missing values/blanks

2022-11-01 Thread Lorenzo Isella (Jira)
Lorenzo Isella created ARROW-18213:
--

 Summary: [R] Arrow 10 silently dropping missing values/blanks
 Key: ARROW-18213
 URL: https://issues.apache.org/jira/browse/ARROW-18213
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Lorenzo Isella


In the example below a single column text file is written to disk. It contains 
some blanks and when it is opened and collected, the blank values are silently 
dropped.

I did not test this behavior on  arrow 9.0.
{code:java}



library(tidyverse)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#> timestamp

ll <- c(  "100",   "1000",  "200"  , "3000" , "50"   ,
"500", ""   ,   "Not Range")


df <- tibble(x=rep(ll, 1000))

df
#> # A tibble: 8,000 × 1
#>x  
#>  
#>  1 "100"  
#>  2 "1000" 
#>  3 "200"  
#>  4 "3000" 
#>  5 "50"   
#>  6 "500"  
#>  7 "" 
#>  8 "Not Range"
#>  9 "100"  
#> 10 "1000" 
#> # … with 7,990 more rows

df |> dim()
#> [1] 80001


write_tsv(df, "data.tsv")

data <- open_dataset("data.tsv", format="tsv",
 skip_rows=1,
 schema=schema(x=string()))

test <- data |>
collect()

test
#> # A tibble: 7,000 × 1
#>x
#>
#>  1 100  
#>  2 1000 
#>  3 200  
#>  4 3000 
#>  5 50   
#>  6 500  
#>  7 Not Range
#>  8 100  
#>  9 1000 
#> 10 200  
#> # … with 6,990 more rows

test |> dim()  ## the missing values/blanks have been dropped silently
#> [1] 70001




sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 11 (bullseye)
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8   LC_NUMERIC=C  
#>  [3] LC_TIME=en_GB.UTF-8LC_COLLATE=en_GB.UTF-8
#>  [5] LC_MONETARY=en_GB.UTF-8LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8   LC_NAME=C 
#>  [9] LC_ADDRESS=C   LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C   
#> 
#> attached base packages:
#> [1] stats graphics  grDevices utils datasets  methods   base 
#> 
#> other attached packages:
#>  [1] arrow_10.0.0forcats_0.5.2   stringr_1.4.1   dplyr_1.0.10   
#>  [5] purrr_0.3.5 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8   
#>  [9] ggplot2_3.3.6   tidyverse_1.3.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] lubridate_1.8.0 assertthat_0.2.1digest_0.6.30  
#>  [4] utf8_1.2.2  R6_2.5.1cellranger_1.1.0   
#>  [7] backports_1.4.1 reprex_2.0.2evaluate_0.17  
#> [10] httr_1.4.4  highr_0.9   pillar_1.8.1   
#> [13] rlang_1.0.6 googlesheets4_1.0.1 readxl_1.4.1   
#> [16] R.utils_2.12.1  R.oo_1.25.0 rmarkdown_2.17 
#> [19] styler_1.8.0googledrive_2.0.0   bit_4.0.4  
#> [22] munsell_0.5.0   broom_1.0.1 compiler_4.2.2 
#> [25] modelr_0.1.9xfun_0.34   pkgconfig_2.0.3
#> [28] htmltools_0.5.3 tidyselect_1.2.0fansi_1.0.3
#> [31] crayon_1.5.2tzdb_0.3.0  dbplyr_2.2.1   
#> [34] withr_2.5.0 R.methodsS3_1.8.2   grid_4.2.2 
#> [37] jsonlite_1.8.3  gtable_0.3.1lifecycle_1.0.3
#> [40] DBI_1.1.3   magrittr_2.0.3  scales_1.2.1   
#> [43] vroom_1.6.0 cli_3.4.1   stringi_1.7.8  
#> [46] fs_1.5.2xml2_1.3.3  ellipsis_0.3.2 
#> [49] generics_0.1.3  vctrs_0.5.0 tools_4.2.2
#> [52] bit64_4.0.5 R.cache_0.16.0  glue_1.6.2 
#> [55] hms_1.1.2   parallel_4.2.2  fastmap_1.1.0  
#> [58] yaml_2.3.6  colorspace_2.0-3gargle_1.2.1   
#> [61] rvest_1.0.3 knitr_1.40  haven_2.5.1
Created on 2022-11-01 with reprex v2.0.2



 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17288) [C++] Create fragment scanners for csv/parquet/orc/ipc

2022-11-01 Thread Apache Arrow JIRA Bot (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627282#comment-17627282
 ] 

Apache Arrow JIRA Bot commented on ARROW-17288:
---

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++] Create fragment scanners for csv/parquet/orc/ipc
> --
>
> Key: ARROW-17288
> URL: https://issues.apache.org/jira/browse/ARROW-17288
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>
> Once we have the basic scan node ready (with an initial implementation based 
> on in-memory fragments) then we can add over the file-format versions.  We 
> may also want to consider adding JSON support for datasets at this time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17867) [C++][FlightRPC] Expose bulk parameter binding in Flight SQL client

2022-11-01 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-17867.
--
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14266
[https://github.com/apache/arrow/pull/14266]

> [C++][FlightRPC] Expose bulk parameter binding in Flight SQL client
> ---
>
> Key: ARROW-17867
> URL: https://issues.apache.org/jira/browse/ARROW-17867
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Also fix various issues noticed as part of ARROW-17661



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18205) [C++] Substrait consumer is not converting right side references correctly on joins

2022-11-01 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace resolved ARROW-18205.
-
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14558
[https://github.com/apache/arrow/pull/14558]

> [C++] Substrait consumer is not converting right side references correctly on 
> joins
> ---
>
> Key: ARROW-18205
> URL: https://issues.apache.org/jira/browse/ARROW-18205
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The Substrait plan expresses a join condition as a logical expression like:
> {{field(0) == field(3)}} where {{0}} and {{3}} are indices into the 
> *combined* schema.  These are then passed down to Acero which expects:
> {{HashJoinNodeOptions(std::vector in_left_keys, 
> std::vector in_right_keys)}}
> However, {{in_left_keys}} are field references into the *left* schema and 
> {{in_right_keys}} are field references into the *right* schema.
> In other words, given the above expression ({{field(0) == field(3)}} if the 
> schema were:
> left:
>   key: int32
>   y: int32
>   z: int32
> right:
>   key: int32
>   x: int32
> Then {{in_left_keys}} should be {{field(0)}} (works correct today) and 
> {{in_right_keys}} should be {{field(0)}} (today we are sending in 
> {{field(3)}}).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18148) [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather

2022-11-01 Thread Nicola Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627236#comment-17627236
 ] 

Nicola Crane commented on ARROW-18148:
--

Thanks for taking a look at this in such close detail - interesting to see the 
detail there around "stream format" and "file format".  I think I'm leaning 
towards option 3 seeming like the way to go.  I'm not keen on pushing the term 
"IPC" on users who are otherwise unaware of it and don't need to be aware of 
it, and like the API in option 3.

Perhaps for now, disregard all of this in that PR you have open and those 
further docs updates can be made in a follow-up PR once the work to update 
these functions is done?

> [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather
> --
>
> Key: ARROW-18148
> URL: https://issues.apache.org/jira/browse/ARROW-18148
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Reporter: Stephanie Hazlitt
>Priority: Minor
>  Labels: feather
>
> Following up from [this mailing list 
> conversation|https://lists.apache.org/thread/nxncph842h8tyovxp04hrzq4y35lq4xq],
>  I am wondering if the R package should rename `read_ipc_file()` / 
> write_ipc_file()` to `read_arrow_file()`/ `write_arrow_file()`, or add an 
> additional alias for both. It might also be helpful to update the 
> documentation so that users read "Write an Arrow file (formerly known as a 
> Feather file)" rather than the current Feather-named first approach, assuming 
> there is a community decision to coalesce around the name Arrow for the file 
> format, and the project is moving on from the name Feather.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18148) [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather

2022-11-01 Thread Stephanie Hazlitt (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627230#comment-17627230
 ] 

Stephanie Hazlitt commented on ARROW-18148:
---

{quote}> "where do we talk about the nuance"
{quote}
One approach when making a big design change is to have a short vignette that 
explains the change itself (e.g. [dbplyr 2.0 did 
this|https://dbplyr.tidyverse.org/articles/backend-2.html]). What is proposed 
is not a breaking change, however if the package moves to having 
`read_arrow(..., format = c("file", "stream", "auto")) it might be worth a 
101-level page on the function naming history, given there was an early version 
of `read_arrow()` which was deprecated, the marketing that needs to be done re: 
feather vs arrow naming and so on. This could also be done in the proposed 
Arrow serialization vignette with pointers, as suggested.

> [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather
> --
>
> Key: ARROW-18148
> URL: https://issues.apache.org/jira/browse/ARROW-18148
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Reporter: Stephanie Hazlitt
>Priority: Minor
>  Labels: feather
>
> Following up from [this mailing list 
> conversation|https://lists.apache.org/thread/nxncph842h8tyovxp04hrzq4y35lq4xq],
>  I am wondering if the R package should rename `read_ipc_file()` / 
> write_ipc_file()` to `read_arrow_file()`/ `write_arrow_file()`, or add an 
> additional alias for both. It might also be helpful to update the 
> documentation so that users read "Write an Arrow file (formerly known as a 
> Feather file)" rather than the current Feather-named first approach, assuming 
> there is a community decision to coalesce around the name Arrow for the file 
> format, and the project is moving on from the name Feather.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16471) RecordBuilder UnmarshalJSON does not handle extra unknown fields with complex values

2022-11-01 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol reassigned ARROW-16471:
-

Assignee: Matthew Topol

> RecordBuilder UnmarshalJSON does not handle extra unknown fields with complex 
> values
> 
>
> Key: ARROW-16471
> URL: https://issues.apache.org/jira/browse/ARROW-16471
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Affects Versions: 7.0.0
>Reporter: Phillip LeBlanc
>Assignee: Matthew Topol
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 10m
>  Remaining Estimate: 23h 50m
>
> The fix for https://issues.apache.org/jira/browse/ARROW-16456 only included 
> support for simple unknown fields with a single value.
> i.e.
> {code:javascript}
> {"region": "NY", "model": "3", "sales": 742.0, "extra": 1234}
> {code}
> However, nested objects or arrays are still not handled properly.
> {code:javascript}
> {"region": "NY", "model": "3", "sales": 742.0, "extra_array": [1234], 
> "extra_object": {"nested": ["deeply"]}}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16471) RecordBuilder UnmarshalJSON does not handle extra unknown fields with complex values

2022-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16471:
---
Labels: pull-request-available  (was: )

> RecordBuilder UnmarshalJSON does not handle extra unknown fields with complex 
> values
> 
>
> Key: ARROW-16471
> URL: https://issues.apache.org/jira/browse/ARROW-16471
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Affects Versions: 7.0.0
>Reporter: Phillip LeBlanc
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 10m
>  Remaining Estimate: 23h 50m
>
> The fix for https://issues.apache.org/jira/browse/ARROW-16456 only included 
> support for simple unknown fields with a single value.
> i.e.
> {code:javascript}
> {"region": "NY", "model": "3", "sales": 742.0, "extra": 1234}
> {code}
> However, nested objects or arrays are still not handled properly.
> {code:javascript}
> {"region": "NY", "model": "3", "sales": 742.0, "extra_array": [1234], 
> "extra_object": {"nested": ["deeply"]}}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND

2022-11-01 Thread Arjan van der Velde (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627216#comment-17627216
 ] 

Arjan van der Velde commented on ARROW-17374:
-

I'm running into this issue while building arrow for R from a conda 
environment. The issue seems to have been introduced by 
https://issues.apache.org/jira/browse/ARROW-16999. Versions prior to that build 
fine.

> [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
> --
>
> Key: ARROW-17374
> URL: https://issues.apache.org/jira/browse/ARROW-17374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0, 8.0.1, 9.0.0
> Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64
>Reporter: Shane Brennan
>Priority: Blocker
> Attachments: build-images.out
>
>
> I've been trying to install Arrow on an R notebook within AWS SageMaker. 
> SageMaker provides Jupyter-like notebooks, with each instance running Amazon 
> Linux 2 as its OS, itself based on RHEL. 
> Trying to install a few ways, e.g., using the standard binaries, using the 
> nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all 
> still result in the following error. 
> {noformat}
> x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared 
> -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common 
> -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags 
> -Wl,--gc-sections -Wl,--allow-shlib-undefined 
> -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib 
> -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib 
> -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o 
> array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o 
> compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o 
> expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o 
> json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o 
> recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o 
> schema.o symbols.o table.o threadpool.o type_infer.o 
> -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib
>  -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz 
> SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread 
> -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto 
> -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR
> x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or 
> directory
> make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: 
> arrow.so] Error 1{noformat}
> Snappy is installed on the systems, and both shared object (.so) and cmake 
> files are there, where I've tried setting the system env variables Snappy_DIR 
> and Snappy_LIB to point at them, but to no avail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18177) [Go] Implement Add/Sub for Temporal Types

2022-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18177:
---
Labels: pull-request-available  (was: )

> [Go] Implement Add/Sub for Temporal Types
> -
>
> Key: ARROW-18177
> URL: https://issues.apache.org/jira/browse/ARROW-18177
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17899) [Go] Add support for Decimal types in go/arrow/csv/reader.go

2022-11-01 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-17899.
---
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14504
[https://github.com/apache/arrow/pull/14504]

> [Go] Add support for Decimal types in go/arrow/csv/reader.go
> 
>
> Key: ARROW-17899
> URL: https://issues.apache.org/jira/browse/ARROW-17899
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Mitchell Devenport
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18212) [C++] NumericBuilder::Reset() doesn't reset all members

2022-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18212:
---
Labels: pull-request-available  (was: )

> [C++] NumericBuilder::Reset() doesn't reset all members
> ---
>
> Key: ARROW-18212
> URL: https://issues.apache.org/jira/browse/ARROW-18212
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jin Shang
>Assignee: Jin Shang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18211) [C++] NumericBuilder::Reset() doesn't reset all members

2022-11-01 Thread Jin Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jin Shang resolved ARROW-18211.
---
Resolution: Duplicate

> [C++] NumericBuilder::Reset() doesn't reset all members
> ---
>
> Key: ARROW-18211
> URL: https://issues.apache.org/jira/browse/ARROW-18211
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jin Shang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-18211) [C++] NumericBuilder::Reset() doesn't reset all members

2022-11-01 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot closed ARROW-18211.
-

> [C++] NumericBuilder::Reset() doesn't reset all members
> ---
>
> Key: ARROW-18211
> URL: https://issues.apache.org/jira/browse/ARROW-18211
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jin Shang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18185) [C++][Compute] Support KEEP_NULL option for compute::Filter

2022-11-01 Thread Jin Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jin Shang reassigned ARROW-18185:
-

Assignee: Jin Shang

> [C++][Compute] Support KEEP_NULL option for compute::Filter
> ---
>
> Key: ARROW-18185
> URL: https://issues.apache.org/jira/browse/ARROW-18185
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Jin Shang
>Assignee: Jin Shang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The current Filter implementation always drops the filtered values. In some 
> use cases, it's desirable for the output array to have the same size as the 
> inut array. So I added a new option FilterOptions::KEEP_NULL where the 
> filtered values are kept as nulls.
> For example, with input [1, 2, 3] and filter [true, false, true], the current 
> implementation will output [1, 3] and with the new option it will output [1, 
> null, 3]
> This option is simpler to implement since we only need to construct a new 
> validity bitmap and reuse the input buffers and child arrays. Except for 
> dense union arrays which don't have validity bitmaps.
> It is also faster to filter with FilterOptions::KEEP_NULL according to the 
> benchmark result in most cases. So users can choose this option for better 
> performance when dropping filtered values is not required.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18212) [C++] NumericBuilder::Reset() doesn't reset all members

2022-11-01 Thread Jin Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jin Shang reassigned ARROW-18212:
-

Assignee: Jin Shang

> [C++] NumericBuilder::Reset() doesn't reset all members
> ---
>
> Key: ARROW-18212
> URL: https://issues.apache.org/jira/browse/ARROW-18212
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jin Shang
>Assignee: Jin Shang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18211) [C++] NumericBuilder::Reset() doesn't reset all members

2022-11-01 Thread Jin Shang (Jira)
Jin Shang created ARROW-18211:
-

 Summary: [C++] NumericBuilder::Reset() doesn't reset all members
 Key: ARROW-18211
 URL: https://issues.apache.org/jira/browse/ARROW-18211
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Jin Shang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18161) Ruby Arrow Tables can have buffers get GC'ed

2022-11-01 Thread Noah Horton (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noah Horton updated ARROW-18161:

Summary: Ruby Arrow Tables can have buffers get GC'ed  (was: Reading Arrow 
table causes mutations)

> Ruby Arrow Tables can have buffers get GC'ed
> 
>
> Key: ARROW-18161
> URL: https://issues.apache.org/jira/browse/ARROW-18161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 9.0.0
> Environment: Ruby 3.1.2
>Reporter: Noah Horton
>Assignee: Kouhei Sutou
>Priority: Major
>
> ven an Arrow::Table with several columns "X"
>  
> {code:ruby}
> # Rails console outputs
> 3.1.2 :107 > x.schema
>  => 
> # dates: date32[day]                             
> expected_values: double>                       
> 3.1.2 :108 > x.schema
>  => 
> # dates: date32[day]                             
> expected_values: double>                       
> 3.1.2 :109 >  {code}
> Note that the object and pointer have both changed values.
> But the far bigger issue is that repeated reads from it will cause different 
> results:
> {code:ruby}
> 3.1.2 :097 > x[1][0]
>  => Sun, 22 Aug 2021 
> 3.1.2 :098 > x[1][1]
>  => nil 
> 3.1.2 :099 > x[1][0]
>  => nil {code}
> I have a lot of issues like this - when I have done these types of read 
> operations, I get the original table with the data in the columns all 
> shuffled around or deleted. 
> I do ingest the data slightly oddly in the first place as it comes in over 
> GRPC and I am using Arrow::Buffer to read it from the GRPC and then passing 
> that into Arrow::Table.load. But I would not expect that once it was in 
> Arrow::Table that I could do anything to permute it unintentionally.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18212) [C++] NumericBuilder::Reset() doesn't reset all members

2022-11-01 Thread Jin Shang (Jira)
Jin Shang created ARROW-18212:
-

 Summary: [C++] NumericBuilder::Reset() doesn't reset all members
 Key: ARROW-18212
 URL: https://issues.apache.org/jira/browse/ARROW-18212
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Jin Shang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18161) Reading Arrow table causes mutations

2022-11-01 Thread Noah Horton (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noah Horton updated ARROW-18161:

Summary: Reading Arrow table causes mutations  (was: Reading error table 
causes mutations)

> Reading Arrow table causes mutations
> 
>
> Key: ARROW-18161
> URL: https://issues.apache.org/jira/browse/ARROW-18161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 9.0.0
> Environment: Ruby 3.1.2
>Reporter: Noah Horton
>Assignee: Kouhei Sutou
>Priority: Major
>
> ven an Arrow::Table with several columns "X"
>  
> {code:ruby}
> # Rails console outputs
> 3.1.2 :107 > x.schema
>  => 
> # dates: date32[day]                             
> expected_values: double>                       
> 3.1.2 :108 > x.schema
>  => 
> # dates: date32[day]                             
> expected_values: double>                       
> 3.1.2 :109 >  {code}
> Note that the object and pointer have both changed values.
> But the far bigger issue is that repeated reads from it will cause different 
> results:
> {code:ruby}
> 3.1.2 :097 > x[1][0]
>  => Sun, 22 Aug 2021 
> 3.1.2 :098 > x[1][1]
>  => nil 
> 3.1.2 :099 > x[1][0]
>  => nil {code}
> I have a lot of issues like this - when I have done these types of read 
> operations, I get the original table with the data in the columns all 
> shuffled around or deleted. 
> I do ingest the data slightly oddly in the first place as it comes in over 
> GRPC and I am using Arrow::Buffer to read it from the GRPC and then passing 
> that into Arrow::Table.load. But I would not expect that once it was in 
> Arrow::Table that I could do anything to permute it unintentionally.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18161) Reading error table causes mutations

2022-11-01 Thread Noah Horton (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627180#comment-17627180
 ] 

Noah Horton commented on ARROW-18161:
-

This appears to have worked for the workaround - thanks. Leaving the ticket as 
the deeper fix would help folks in the future.

> Reading error table causes mutations
> 
>
> Key: ARROW-18161
> URL: https://issues.apache.org/jira/browse/ARROW-18161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 9.0.0
> Environment: Ruby 3.1.2
>Reporter: Noah Horton
>Assignee: Kouhei Sutou
>Priority: Major
>
> ven an Arrow::Table with several columns "X"
>  
> {code:ruby}
> # Rails console outputs
> 3.1.2 :107 > x.schema
>  => 
> # dates: date32[day]                             
> expected_values: double>                       
> 3.1.2 :108 > x.schema
>  => 
> # dates: date32[day]                             
> expected_values: double>                       
> 3.1.2 :109 >  {code}
> Note that the object and pointer have both changed values.
> But the far bigger issue is that repeated reads from it will cause different 
> results:
> {code:ruby}
> 3.1.2 :097 > x[1][0]
>  => Sun, 22 Aug 2021 
> 3.1.2 :098 > x[1][1]
>  => nil 
> 3.1.2 :099 > x[1][0]
>  => nil {code}
> I have a lot of issues like this - when I have done these types of read 
> operations, I get the original table with the data in the columns all 
> shuffled around or deleted. 
> I do ingest the data slightly oddly in the first place as it comes in over 
> GRPC and I am using Arrow::Buffer to read it from the GRPC and then passing 
> that into Arrow::Table.load. But I would not expect that once it was in 
> Arrow::Table that I could do anything to permute it unintentionally.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18183) cpp-micro benchmarks are failing on mac arm machine

2022-11-01 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627172#comment-17627172
 ] 

Weston Pace commented on ARROW-18183:
-

Thank you.  I will look at this today.

> cpp-micro benchmarks are failing on mac arm machine
> ---
>
> Key: ARROW-18183
> URL: https://issues.apache.org/jira/browse/ARROW-18183
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Benchmarking, C++
>Reporter: Elena Henderson
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18207) Rubygems not updating in concert with majors

2022-11-01 Thread Noah Horton (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627145#comment-17627145
 ] 

Noah Horton commented on ARROW-18207:
-

I want to really call out that I appreciate the work of the team, and don't 
like throwing on opinions when I am not doing the work.  That said... ;)

Effectively we get to choose holding arrow releases on this stuff (unlikely) 
breaking windows deployments of arrow apps on ruby, or breaking docker 
deployments on ruby. I think Docker should win. I cannot imagine that many 
people are deploying arrow ruby apps on windows without virtualization to Linux.

> Rubygems not updating in concert with majors
> 
>
> Key: ARROW-18207
> URL: https://issues.apache.org/jira/browse/ARROW-18207
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 10.0.0
>Reporter: Noah Horton
>Assignee: Kouhei Sutou
>Priority: Major
>
> 10.0.0 just released, meaning that that all install scripts that use the 
> 'latest' tag are getting it.
> Yet rubygems.org is still running with the 9.0.0 version a week after 10.0.0 
> released.
> The build scripts need to start updating rubygems.org automatically, or guide 
> users to a bundler config like 
> {code:ruby}
> gem "red-arrow", github: "apache/arrow", glob: "ruby/red-arrow/*.gemspec", 
> require: "arrow", tag: 'apache-arrow-10.0.0'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18202) [R] gsub does not work properly

2022-11-01 Thread Dewey Dunnington (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627139#comment-17627139
 ] 

Dewey Dunnington commented on ARROW-18202:
--

Thank you for reporting! This does sound like it is invalid behaviour. A 
slightly more minimal reprex with Arrow 10.0.0.

{code:R}
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for 
more information.
library(dplyr, warn.conflicts = FALSE)

vals <- c("100",   "1000",  "200"  , "3000" , "50",
  "500", "", "Not Range")

record_batch(vals = vals) |> 
  mutate(vals2 = gsub("^$", "0", vals)) |> 
  collect()
#> # A tibble: 8 × 2
#>   valsvals2  
#>
#> 1 "100"   "100"  
#> 2 "1000"  "1000" 
#> 3 "200"   "200"  
#> 4 "3000"  "3000" 
#> 5 "50""50"   
#> 6 "500"   "500"  
#> 7 ""  "" 
#> 8 "Not Range" "Not Range"

tibble::tibble(vals = vals) |> 
  mutate(vals2 = gsub("^$", "0", vals)) |> 
  collect()
#> # A tibble: 8 × 2
#>   valsvals2
#>  
#> 1 "100"   100  
#> 2 "1000"  1000 
#> 3 "200"   200  
#> 4 "3000"  3000 
#> 5 "50"50   
#> 6 "500"   500  
#> 7 ""  0
#> 8 "Not Range" Not Range
{code}


> [R] gsub does not work properly
> ---
>
> Key: ARROW-18202
> URL: https://issues.apache.org/jira/browse/ARROW-18202
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Lorenzo Isella
>Priority: Major
>
> Hello,
> I think there is a problem with arrow 10.0 and R. I did not have this issue 
> with arrow 9.0.
> Could you please have a look?
> Many thanks
>  
> {code:r}
> library(tidyverse)
> library(arrow)
> ll <- c(      "100",   "1000",  "200"  , "3000" , "50"   ,
>         "500", ""   ,   "Not Range")
> df <- tibble(x=rep(ll, 1000), y=seq(8000))
> write_tsv(df, "data.tsv")
> data <- open_dataset("data.tsv", format="tsv",
>                      skip_rows=1,
>                      schema=schema(x=string(),
>                      y=double())
> )
> test <- data |>
>     collect()
> ###I want to replace the "" with "0". I believe this worked with arrow 9.0
> df2 <- data |>
>     mutate(x=gsub("^$","0",x) ) |>
>     collect()
> df2 ### now I did not modify the  "" entries in x
> #> # A tibble: 8,000 × 2
> #>    x               y
> #>           
> #>  1 "100"       1
> #>  2 "1000"      2
> #>  3 "200"       3
> #>  4 "3000"      4
> #>  5 "50"        5
> #>  6 "500"       6
> #>  7 ""              7
> #>  8 "Not Range"     8
> #>  9 "100"       9
> #> 10 "1000"     10
> #> # … with 7,990 more rows
>  
> df3 <- df |>
>     mutate(x=gsub("^$","0",x) )
> df3  ## and this is fine
> #> # A tibble: 8,000 × 2
> #>    x             y
> #>         
> #>  1 100       1
> #>  2 1000      2
> #>  3 200       3
> #>  4 3000      4
> #>  5 50        5
> #>  6 500       6
> #>  7 0             7
> #>  8 Not Range     8
> #>  9 100       9
> #> 10 1000     10
> #> # … with 7,990 more rows
> ## How to fix this...I believe this issue did not arise with arrow 9.0.
> sessionInfo()
> #> R version 4.2.1 (2022-06-23)
> #> Platform: x86_64-pc-linux-gnu (64-bit)
> #> Running under: Debian GNU/Linux 11 (bullseye)
> #> 
> #> Matrix products: default
> #> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
> #> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
> #> 
> #> locale:
> #>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
> #>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
> #>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
> #>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
> #>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
> #> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
> #> 
> #> attached base packages:
> #> [1] stats     graphics  grDevices utils     datasets  methods   base     
> #> 
> #> other attached packages:
> #>  [1] arrow_10.0.0    forcats_0.5.2   stringr_1.4.1   dplyr_1.0.10   
> #>  [5] purrr_0.3.5     readr_2.1.3     tidyr_1.2.1     tibble_3.1.8   
> #>  [9] ggplot2_3.3.6   tidyverse_1.3.2
> #> 
> #> loaded via a namespace (and not attached):
> #>  [1] lubridate_1.8.0     assertthat_0.2.1    digest_0.6.30      
> #>  [4] utf8_1.2.2          R6_2.5.1            cellranger_1.1.0   
> #>  [7] backports_1.4.1     reprex_2.0.2        evaluate_0.17      
> #> [10] httr_1.4.4          highr_0.9           pillar_1.8.1       
> #> [13] rlang_1.0.6         googlesheets4_1.0.1 readxl_1.4.1       
> #> [16] R.utils_2.12.1      R.oo_1.25.0         rmarkdown_2.17     
> #> [19] styler_1.8.0        googledrive_2.0.0   

[jira] [Commented] (ARROW-18204) [R] Allow setting field metadata

2022-11-01 Thread Dewey Dunnington (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627138#comment-17627138
 ] 

Dewey Dunnington commented on ARROW-18204:
--

We should totally support this!

A workaround in case you need it:

{code:R}
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for 
more information.
# remotes::install_github("paleolimbt/narrow")
library(narrow)

set_field_metadata <- function(field, ...) {
  vals <- rlang::list2(...)
  cschema <- narrow::as_narrow_schema(field)
  current_vals <- cschema$metadata
  keys <- union(names(vals), names(current_vals))
  cschema$metadata <- c(vals, current_vals)[keys]
  arrow::Field$import_from_c(cschema)
}

field_metadata <- function(field) {
  narrow::as_narrow_schema(field)$metadata
}

(f <- field("some name", int32()))
#> Field
#> some name: int32
f_meta <- set_field_metadata(f, some_key = "some value")

field_metadata(f_meta)
#> $some_key
#> [1] "some value"
{code}


> [R] Allow setting field metadata
> 
>
> Key: ARROW-18204
> URL: https://issues.apache.org/jira/browse/ARROW-18204
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 10.0.0
>Reporter: Will Jones
>Priority: Major
>
> Currently, can't create a {{Field}} with metadata, which makes it hard to 
> create tests regarding field metadata. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17731) [Website] Add blog post about Flight SQL JDBC driver

2022-11-01 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-17731.
--
Resolution: Fixed

Issue resolved by pull request 236
[https://github.com/apache/arrow-site/pull/236]

> [Website] Add blog post about Flight SQL JDBC driver
> 
>
> Key: ARROW-17731
> URL: https://issues.apache.org/jira/browse/ARROW-17731
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Website
>Reporter: David Li
>Assignee: David Li
>Priority: Major
> Fix For: 11.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18183) cpp-micro benchmarks are failing on mac arm machine

2022-11-01 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627037#comment-17627037
 ] 

Yibo Cai commented on ARROW-18183:
--

I tried setting ARROW_DEFAULT_MEMORY_POOL to "jemalloc" and "system", same 
error happens.

> cpp-micro benchmarks are failing on mac arm machine
> ---
>
> Key: ARROW-18183
> URL: https://issues.apache.org/jira/browse/ARROW-18183
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Benchmarking, C++
>Reporter: Elena Henderson
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18183) cpp-micro benchmarks are failing on mac arm machine

2022-11-01 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627030#comment-17627030
 ] 

Yibo Cai commented on ARROW-18183:
--

Tested on M1, all "arrow-dataset-scanner-benchmark/scan_alg=1" tests failed 
with SIGBUS. "scan_alg=0" tests are okay.
Stack depth is approaching 4000 from the backtrace. Looks there are a call loop 
among \{future,async_util\}.\{h,cc\}.
ASAN identified stack overflow, logs attached.
cc [~westonpace]

{code:bash}
# all scan_alg:0 tests are okay, all scan_alg:1 tests cause sigbus
% debug/arrow-dataset-scanner-benchmark --benchmark_filter=".*scan_alg:1.*" 

/Users/linux/cyb/arrow/cpp/src/arrow/memory_pool.cc:113: Unsupported backend 
'mimalloc' specified in ARROW_DEFAULT_MEMORY_POOL (supported backends are 
'jemalloc', 'system')
Unable to determine clock rate from sysctl: hw.cpufrequency: No such file or 
directory
This does not affect benchmark measurements, only the metadata output.
2022-11-01T17:02:15+08:00
Running debug/arrow-dataset-scanner-benchmark
Run on (8 X 24.2408 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x8)
Load Average: 2.06, 2.81, 2.72
AddressSanitizer:DEADLYSIGNAL
=
==75674==ERROR: AddressSanitizer: stack-overflow on address 0x00016b9b3fc0 (pc 
0x000106b4b3b4 bp 0x000106b4b3a0 sp 0x00016b9b3fa0 T1)
#0 0x106b4b3b4 in __sanitizer::StackDepotBase<__sanitizer::StackDepotNode, 
1, 20>::Put(__sanitizer::StackTrace, bool*)+0x4 
(libclang_rt.asan_osx_dynamic.dylib:arm64+0x5f3b4)

SUMMARY: AddressSanitizer: stack-overflow 
(libclang_rt.asan_osx_dynamic.dylib:arm64+0x5f3b4) in 
__sanitizer::StackDepotBase<__sanitizer::StackDepotNode, 1, 
20>::Put(__sanitizer::StackTrace, bool*)+0x4
Thread T1 created by T0 here:
#0 0x106b2680c in wrap_pthread_create+0x50 
(libclang_rt.asan_osx_dynamic.dylib:arm64+0x3a80c)
#1 0x113b7a408 in std::__1::__libcpp_thread_create(_opaque_pthread_t**, 
void* (*)(void*), void*) __threading_support:375
#2 0x113b7a128 in 
std::__1::thread::thread(arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3&&) 
thread:309
#3 0x113b67e94 in 
std::__1::thread::thread(arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3&&) 
thread:301
#4 0x113b66794 in arrow::internal::ThreadPool::LaunchWorkersUnlocked(int) 
thread_pool.cc:412
#5 0x113b68444 in 
arrow::internal::ThreadPool::SpawnReal(arrow::internal::TaskHints, 
arrow::internal::FnOnce, arrow::StopToken, 
arrow::internal::FnOnce&&) thread_pool.cc:448
#6 0x10488dfd8 in 
arrow::Result
 > > > > arrow::internal::Executor::Submit
 > > > >(arrow::internal::TaskHints, arrow::StopToken, 
arrow::dataset::(anonymous namespace)::GetFragments(arrow::dataset::Dataset*, 
arrow::compute::Expression)::$_0&&) thread_pool.h:167
#7 0x10488be74 in 
arrow::Result
 > > > > arrow::internal::Executor::Submit
 > > > >(arrow::dataset::(anonymous 
namespace)::GetFragments(arrow::dataset::Dataset*, 
arrow::compute::Expression)::$_0&&) thread_pool.h:193
#8 0x10488ac0c in arrow::dataset::(anonymous 
namespace)::GetFragments(arrow::dataset::Dataset*, arrow::compute::Expression) 
scan_node.cc:64
#9 0x10488a010 in arrow::dataset::(anonymous 
namespace)::ScanNode::StartProducing() scan_node.cc:318
#10 0x113fc43e0 in arrow::compute::(anonymous 
namespace)::ExecPlanImpl::StartProducing() exec_plan.cc:183
#11 0x113fc362c in arrow::compute::ExecPlan::StartProducing() 
exec_plan.cc:400
#12 0x104462260 in arrow::dataset::MinimalEndToEndScan(unsigned long, 
unsigned long, std::__1::basic_string, 
std::__1::allocator > const&, 
std::__1::function
 > (unsigned long, unsigned long)>) scanner_benchmark.cc:159
#13 0x104468ebc in arrow::dataset::MinimalEndToEndBench(benchmark::State&) 
scanner_benchmark.cc:272
#14 0x1055dbc8c in benchmark::internal::BenchmarkInstance::Run(long long, 
int, benchmark::internal::ThreadTimer*, benchmark::internal::ThreadManager*, 
benchmark::internal::PerfCountersMeasurement*) const+0x44 
(libbenchmark.1.7.0.dylib:arm64+0xbc8c)
#15 0x1055ed708 in benchmark::internal::(anonymous 
namespace)::RunInThread(benchmark::internal::BenchmarkInstance const*, long 
long, int, benchmark::internal::ThreadManager*, 
benchmark::internal::PerfCountersMeasurement*)+0x58 
(libbenchmark.1.7.0.dylib:arm64+0x1d708)
#16 0x1055ed2c8 in 
benchmark::internal::BenchmarkRunner::DoNIterations()+0x2c0 
(libbenchmark.1.7.0.dylib:arm64+0x1d2c8)
#17 0x1055edfec in 
benchmark::internal::BenchmarkRunner::DoOneRepetition()+0xb0 
(libbenchmark.1.7.0.dylib:arm64+0x1dfec)
#18 0x1055d4fb8 in 
benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, 
benchmark::BenchmarkReporter*, std::__1::basic_string, std::__1::allocator >)+0x9f0 
(libbenchmark.1.7.0.dylib:arm64+0x4fb8)
#19 0x1055d4564 in benchmark::RunSpecifiedBenchmarks()+0x3c 

[jira] [Resolved] (ARROW-18162) [C++] Add Arm SVE compiler options

2022-11-01 Thread Yibo Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai resolved ARROW-18162.
--
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14515
[https://github.com/apache/arrow/pull/14515]

> [C++] Add Arm SVE compiler options
> --
>
> Key: ARROW-18162
> URL: https://issues.apache.org/jira/browse/ARROW-18162
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> {{xsimd}} 9.0+ supports Arm SVE (fixed size). Some additional compiler 
> options are required to enable SVE.
> Per my test on Amazon Graviton3 (SVE-256). SVE256 performs much better than 
> NEON for some cases. E.g., utf8 benchmark {{ValidateLargeAscii}} improves 
> from *38.6* (NEON) to *51.5* (SVE256) GB/s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-18210) [C++][Parquet] Skip check in StreamWriter

2022-11-01 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627001#comment-17627001
 ] 

Antoine Pitrou edited comment on ARROW-18210 at 11/1/22 8:21 AM:
-

I see. I don't think you can expect excellent performance from 
{{{}StreamWriter{}}}. Parquet is a columnar format, so you should feed the data 
column-wise rather than row-wise. Take a look at the {{TypedColumnWriter}} 
class and ensure you write data in batches.


was (Author: pitrou):
I see. I don't think you can expect excellent performance from StreamWriter. 
Parquet is a columnar format, so you should feed the data column-wise rather 
than row-wise. Take a look at the {{TypedColumnWriter}} and ensure you write 
data in batches.

> [C++][Parquet] Skip check in StreamWriter
> -
>
> Key: ARROW-18210
> URL: https://issues.apache.org/jira/browse/ARROW-18210
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Affects Versions: 10.0.0
>Reporter: Madhur
>Priority: Major
>
> Currently StreamWriter is slower only because of checking of columns, if we 
> allow customization option (maybe ctor arg) to skip the check then 
> StreamWriter can be more efficient?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18210) [C++][Parquet] Skip check in StreamWriter

2022-11-01 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627001#comment-17627001
 ] 

Antoine Pitrou commented on ARROW-18210:


I see. I don't think you can expect excellent performance from StreamWriter. 
Parquet is a columnar format, so you should feed the data column-wise rather 
than row-wise. Take a look at the {{TypedColumnWriter}} and ensure you write 
data in batches.

> [C++][Parquet] Skip check in StreamWriter
> -
>
> Key: ARROW-18210
> URL: https://issues.apache.org/jira/browse/ARROW-18210
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Affects Versions: 10.0.0
>Reporter: Madhur
>Priority: Major
>
> Currently StreamWriter is slower only because of checking of columns, if we 
> allow customization option (maybe ctor arg) to skip the check then 
> StreamWriter can be more efficient?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18210) [C++][Parquet] Skip check in StreamWriter

2022-11-01 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18210:
---
Summary: [C++][Parquet] Skip check in StreamWriter  (was: Skip check in 
StreamWriter)

> [C++][Parquet] Skip check in StreamWriter
> -
>
> Key: ARROW-18210
> URL: https://issues.apache.org/jira/browse/ARROW-18210
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 10.0.0
>Reporter: Madhur
>Priority: Major
>
> Currently StreamWriter is slower only because of checking of columns, if we 
> allow customization option (maybe ctor arg) to skip the check then 
> StreamWriter can be more efficient?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18210) [C++][Parquet] Skip check in StreamWriter

2022-11-01 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18210:
---
Component/s: Parquet

> [C++][Parquet] Skip check in StreamWriter
> -
>
> Key: ARROW-18210
> URL: https://issues.apache.org/jira/browse/ARROW-18210
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Affects Versions: 10.0.0
>Reporter: Madhur
>Priority: Major
>
> Currently StreamWriter is slower only because of checking of columns, if we 
> allow customization option (maybe ctor arg) to skip the check then 
> StreamWriter can be more efficient?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18210) Skip check in StreamWriter

2022-11-01 Thread Madhur (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627000#comment-17627000
 ] 

Madhur commented on ARROW-18210:


Yes 
https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet12StreamWriterE

> Skip check in StreamWriter
> --
>
> Key: ARROW-18210
> URL: https://issues.apache.org/jira/browse/ARROW-18210
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 10.0.0
>Reporter: Madhur
>Priority: Major
>
> Currently StreamWriter is slower only because of checking of columns, if we 
> allow customization option (maybe ctor arg) to skip the check then 
> StreamWriter can be more efficient?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18210) Skip check in StreamWriter

2022-11-01 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17626998#comment-17626998
 ] 

Antoine Pitrou commented on ARROW-18210:


Your issue description is not clear, is it about the Parquet StreamWriter?

> Skip check in StreamWriter
> --
>
> Key: ARROW-18210
> URL: https://issues.apache.org/jira/browse/ARROW-18210
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 10.0.0
>Reporter: Madhur
>Priority: Major
>
> Currently StreamWriter is slower only because of checking of columns, if we 
> allow customization option (maybe ctor arg) to skip the check then 
> StreamWriter can be more efficient?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17239) [C++] Calculate output type from aggregate to convert arrow aggregate to substrait

2022-11-01 Thread Vibhatha Lakmal Abeykoon (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vibhatha Lakmal Abeykoon reassigned ARROW-17239:


Assignee: Vibhatha Lakmal Abeykoon

> [C++] Calculate output type from aggregate to convert arrow aggregate to 
> substrait
> --
>
> Key: ARROW-17239
> URL: https://issues.apache.org/jira/browse/ARROW-17239
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: substrait
> Fix For: 11.0.0
>
>
> I am adding support for mapping to/from Arrow aggregates and Substrait 
> aggregates in ARROW-15582.  However, the Arrow-Substrait direction is 
> currently blocked because the Substrait plan needs to know the output type of 
> an aggregate and there is no easy way to determine that from the Arrow 
> information we have.
> We should be able to get this information from the function registry but the 
> conversion routines do not have access to the function registry that I can 
> tell.  I'm not sure if the best solution is to pass the function registry 
> into ToProto or to add the output type to the aggregate in Arrow.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18205) [C++] Substrait consumer is not converting right side references correctly on joins

2022-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18205:
---
Labels: pull-request-available  (was: )

> [C++] Substrait consumer is not converting right side references correctly on 
> joins
> ---
>
> Key: ARROW-18205
> URL: https://issues.apache.org/jira/browse/ARROW-18205
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The Substrait plan expresses a join condition as a logical expression like:
> {{field(0) == field(3)}} where {{0}} and {{3}} are indices into the 
> *combined* schema.  These are then passed down to Acero which expects:
> {{HashJoinNodeOptions(std::vector in_left_keys, 
> std::vector in_right_keys)}}
> However, {{in_left_keys}} are field references into the *left* schema and 
> {{in_right_keys}} are field references into the *right* schema.
> In other words, given the above expression ({{field(0) == field(3)}} if the 
> schema were:
> left:
>   key: int32
>   y: int32
>   z: int32
> right:
>   key: int32
>   x: int32
> Then {{in_left_keys}} should be {{field(0)}} (works correct today) and 
> {{in_right_keys}} should be {{field(0)}} (today we are sending in 
> {{field(3)}}).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-18148) [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather

2022-11-01 Thread Danielle Navarro (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17626980#comment-17626980
 ] 

Danielle Navarro edited comment on ARROW-18148 at 11/1/22 6:56 AM:
---

Tentatively offering some thoughts :-)

If I'm understanding this properly, we have two problems:

- The first problem is that the history of serializing Arrow objects is messy 
and has left us with three names that people might recognize: Feather, IPC, 
Arrow. We'd like users to transition to using "Arrow" as the preferred name, 
and to give them an API that reflects that terminology.

- The second problem is that we use "file format" and "stream format" to mean 
something subtly different from "files" and "streams". The file format wraps 
the stream format with magic numbers at the start and end, with a footer 
written after the stream. These two formats aren't *inherently* tied to files 
and streams. The user can write a "stream formatted" file if they want (i.e., 
no magic numbers, no footers) and they can also send a "file formatted" 
serialization (i.e., with the magic number and footer) to an output stream if 
they want to. The current API allows this, but users would be forgiven for 
missing this subtle detail!

h2. Option 1: Don't change the API, only the docs

This option would leave `read_ipc_file()`, `write_ipc_file()`, 
`read_ipc_stream()`, and `write_ipc_stream()` as the four user-facing functions 
(treating `read_feather()` and `write_feather()` as soft-deprecated, and 
leaving `write_to_raw()` untouched)

The only thing that would change in this version is that we would consistently 
refer to "Arrow IPC file" and "Arrow IPC stream" everywhere (i.e., never 
truncating it to "IPC"). Language around "feather" would be relegated to a 
secondary position (e.g., "formerly known as Feather"), and we would emphasize 
that the preferred file extension for V2 feather files is `.arrow`. 

h2. Option 2: New names for the existing four functions

This option would replace `read_ipc_file()` with `read_arrow_file()`, 
`read_ipc_stream()` with `read_arrow_stream()` and so on. The `ipc` and 
`feather` versions would be soft-deprecated.  

The documentation would be updated accordingly. We'd now refer to "Arrow file" 
and "Arrow stream" everywhere. As with option 1 we'd use language like 
"formerly known as Feather" to explain the history (perhaps linking back to the 
old repo just to highlight the origin). We would also, where relevant, note 
that "Arrow stream" is a conventional name for the "Arrow inter-process 
communication (IPC) streaming format", as a way of (a) explaining the ipc 
versions of the functions, and (b) helping users find the relevant part of the 
Arrow specification.

h2. Option 3: Reduce API to two functions

This option would have only two functions, `read_arrow()` and `write_arrow()`. 
Both functions would have a new argument called `format` (or something 
similar). Users could specify either `format = "stream"` or `format = "file"`. 
From a documentation perspective this would require a little more finessing: we 
might have to have separate the help topics for the new API and older versions 
of API to avoid mess. But it might have the advantage of making clearer to 
users that the terms `"stream"` and `"file"` don't actually refer to *where* 
you're writing the data, but how you *encode* the data when you write it. 

h2. Preferences?

I am not sure what I prefer, but I can at least say what I think the strengths 
and weaknesses are for each proposal:

Option 3 seems like the cleanest in terms of making the Arrow/Feather/IPC 
functions feel analogous to the other functions in the read/write API: 
`read_arrow()` and `write_arrow()` feels closely aligned with `read_parquet()` 
and `write_parquet()`. It makes very clear that these functions are designed to 
read and write Arrow objects in an "Arrow-like" way. However, it does have the 
disadvantage that the encoding vs destination complexity gets pushed into the 
arguments: users will need to understand why there is `format` argument that is 
distinct from the `file`/`sink` argument, and the documentation will need to 
explain that. 

Option 2 has the advantage of preserving the same "four-function structure"" as 
the existing serialization API, but it does come at the expense of being a 
little misleading to anyone who doesn't understand that the function names 
refer to the encoding not the destination: `write_arrow_stream()` can in fact 
write to a file, and `write_arrow_file()` can write to a stream. That's 
potentially even more confusing. 

Option 1 has the advantage of not confusing existing users. The API doesn't 
change, and the documentation becomes slightly more informative. The 
disadvantage is that it leaves new users a bit confused about what the heck an 
"IPC" is, which means the documentation will have to carry the load. 

h2. 

[jira] [Comment Edited] (ARROW-18148) [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather

2022-11-01 Thread Danielle Navarro (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17626980#comment-17626980
 ] 

Danielle Navarro edited comment on ARROW-18148 at 11/1/22 6:54 AM:
---

Tentatively offering some thoughts :-)

If I'm understanding this properly, we have two problems:

- The first problem is that the history of serializing Arrow objects is messy 
and has left us with three names that people might recognize: Feather, IPC, 
Arrow. We'd like users to transition to using "Arrow" as the preferred name, 
and to give them an API that reflects that terminology.

- The second problem is that we use "file format" and "stream format" to mean 
something subtly different from "files" and "streams". The file format wraps 
the stream format with magic numbers at the start and end, with a footer 
written after the stream. These two formats aren't *inherently* tied to files 
and streams. The user can write a "stream formatted" file if they want (i.e., 
no magic numbers, no footers) and they can also send a "file formatted" 
serialization (i.e., with the magic number and footer) to an output stream if 
they want to. The current API allows this, but users would be forgiven for 
missing this subtle detail!

h2. Option 1: Don't change the API, only the docs

This option would leave `read_ipc_file()`, `write_ipc_file()`, 
`read_ipc_stream()`, and `write_ipc_stream()` as the four user-facing functions 
(treating `read_feather()` and `write_feather()` as soft-deprecated, and 
leaving `write_to_raw()` untouched)

The only thing that would change in this version is that we would consistently 
refer to "Arrow IPC file" and "Arrow IPC stream" everywhere (i.e., never 
truncating it to "IPC"). Language around "feather" would be relegated to a 
secondary position (e.g., "formerly known as Feather"), and we would emphasize 
that the preferred file extension is `.arrow`. 

h2. Option 2: New names for the existing four functions

This option would replace `read_ipc_file()` with `read_arrow_file()`, 
`read_ipc_stream()` with `read_arrow_stream()` and so on. The `ipc` and 
`feather` versions would be soft-deprecated.  

The documentation would be updated accordingly. We'd now refer to "Arrow file" 
and "Arrow stream" everywhere. As with option 1 we'd use language like 
"formerly known as Feather" to explain the history (perhaps linking back to the 
old repo just to highlight the origin). We would also, where relevant, note 
that "Arrow stream" is a conventional name for the "Arrow inter-process 
communication (IPC) streaming format", as a way of (a) explaining the ipc 
versions of the functions, and (b) helping users find the relevant part of the 
Arrow specification.

h2. Option 3: Reduce API to two functions

This option would have only two functions, `read_arrow()` and `write_arrow()`. 
Both functions would have a new argument called `format` (or something 
similar). Users could specify either `format = "stream"` or `format = "file"`. 
From a documentation perspective this would require a little more finessing: we 
might have to have separate the help topics for the new API and older versions 
of API to avoid mess. But it might have the advantage of making clearer to 
users that the terms `"stream"` and `"file"` don't actually refer to *where* 
you're writing the data, but how you *encode* the data when you write it. 

h2. Preferences?

I am not sure what I prefer, but I can at least say what I think the strengths 
and weaknesses are for each proposal:

Option 3 seems like the cleanest in terms of making the Arrow/Feather/IPC 
functions feel analogous to the other functions in the read/write API: 
`read_arrow()` and `write_arrow()` feels closely aligned with `read_parquet()` 
and `write_parquet()`. It makes very clear that these functions are designed to 
read and write Arrow objects in an "Arrow-like" way. However, it does have the 
disadvantage that the encoding vs destination complexity gets pushed into the 
arguments: users will need to understand why there is `format` argument that is 
distinct from the `file`/`sink` argument, and the documentation will need to 
explain that. 

Option 2 has the advantage of preserving the same "four-function structure"" as 
the existing serialization API, but it does come at the expense of being a 
little misleading to anyone who doesn't understand that the function names 
refer to the encoding not the destination: `write_arrow_stream()` can in fact 
write to a file, and `write_arrow_file()` can write to a stream. That's 
potentially even more confusing. 

Option 1 has the advantage of not confusing existing users. The API doesn't 
change, and the documentation becomes slightly more informative. The 
disadvantage is that it leaves new users a bit confused about what the heck an 
"IPC" is, which means the documentation will have to carry the load. 

h2. Additional documentation 

[jira] [Created] (ARROW-18210) Skip check in StreamWriter

2022-11-01 Thread Madhur (Jira)
Madhur created ARROW-18210:
--

 Summary: Skip check in StreamWriter
 Key: ARROW-18210
 URL: https://issues.apache.org/jira/browse/ARROW-18210
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 10.0.0
Reporter: Madhur


Currently StreamWriter is slower only because of checking of columns, if we 
allow customization option (maybe ctor arg) to skip the check then StreamWriter 
can be more efficient?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18148) [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather

2022-11-01 Thread Danielle Navarro (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17626980#comment-17626980
 ] 

Danielle Navarro commented on ARROW-18148:
--

Tentatively offering some thoughts :-)

If I'm understanding this properly, we have two problems:

- The first problem is that the history of serializing Arrow objects is messy 
and has left us with three names that people might recognize: Feather, IPC, 
Arrow. We'd like users to transition to using "Arrow" as the preferred name, 
and to give them an API that reflects that terminology.

- The second problem is that we use "file format" and "stream format" to mean 
something subtly different from "files" and "streams". The file format wraps 
the stream format with magic numbers at the start and end, with a footer 
written after the stream. These two formats aren't *inherently* tied to files 
and streams. The user can write a "stream formatted" file if they want (i.e., 
no magic numbers, no footers) and they can also send a "file formatted" 
serialization (i.e., with the magic number and footer) to an output stream if 
they want to. The current API allows this, but users would be forgiven for 
missing this subtle detail!

## Option 1: Don't change the API, only the docs

This option would leave `read_ipc_file()`, `write_ipc_file()`, 
`read_ipc_stream()`, and `write_ipc_stream()` as the four user-facing functions 
(treating `read_feather()` and `write_feather()` as soft-deprecated, and 
leaving `write_to_raw()` untouched)

The only thing that would change in this version is that we would consistently 
refer to "Arrow IPC file" and "Arrow IPC stream" everywhere (i.e., never 
truncating it to "IPC"). Language around "feather" would be relegated to a 
secondary position (e.g., "formerly known as Feather"), and we would emphasize 
that the preferred file extension is `.arrow`. 

## Option 2: New names for the existing four functions

This option would replace `read_ipc_file()` with `read_arrow_file()`, 
`read_ipc_stream()` with `read_arrow_stream()` and so on. The `ipc` and 
`feather` versions would be soft-deprecated.  

The documentation would be updated accordingly. We'd now refer to "Arrow file" 
and "Arrow stream" everywhere. As with option 1 we'd use language like 
"formerly known as Feather" to explain the history (perhaps linking back to the 
old repo just to highlight the origin). We would also, where relevant, note 
that "Arrow stream" is a conventional name for the "Arrow inter-process 
communication (IPC) streaming format", as a way of (a) explaining the ipc 
versions of the functions, and (b) helping users find the relevant part of the 
Arrow specification.

## Option 3: Reduce API to two functions

This option would have only two functions, `read_arrow()` and `write_arrow()`. 
Both functions would have a new argument called `format` (or something 
similar). Users could specify either `format = "stream"` or `format = "file"`. 
From a documentation perspective this would require a little more finessing: we 
might have to have separate the help topics for the new API and older versions 
of API to avoid mess. But it might have the advantage of making clearer to 
users that the terms `"stream"` and `"file"` don't actually refer to *where* 
you're writing the data, but how you *encode* the data when you write it. 

## Preferences?

I am not sure what I prefer, but I can at least say what I think the strengths 
and weaknesses are for each proposal:

Option 3 seems like the cleanest in terms of making the Arrow/Feather/IPC 
functions feel analogous to the other functions in the read/write API: 
`read_arrow()` and `write_arrow()` feels closely aligned with `read_parquet()` 
and `write_parquet()`. It makes very clear that these functions are designed to 
read and write Arrow objects in an "Arrow-like" way. However, it does have the 
disadvantage that the encoding vs destination complexity gets pushed into the 
arguments: users will need to understand why there is `format` argument that is 
distinct from the `file`/`sink` argument, and the documentation will need to 
explain that. 

Option 2 has the advantage of preserving the same "four-function structure"" as 
the existing serialization API, but it does come at the expense of being a 
little misleading to anyone who doesn't understand that the function names 
refer to the encoding not the destination: `write_arrow_stream()` can in fact 
write to a file, and `write_arrow_file()` can write to a stream. That's 
potentially even more confusing. 

Option 1 has the advantage of not confusing existing users. The API doesn't 
change, and the documentation becomes slightly more informative. The 
disadvantage is that it leaves new users a bit confused about what the heck an 
"IPC" is, which means the documentation will have to carry the load. 

## Additional documentation thoughts

Regardless of what option we go with, I'll 

[jira] [Updated] (ARROW-18209) [Java] Make ComplexCopier agnostic of specific implementation of MapWriter, i.e UnionMapWriter

2022-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18209:
---
Labels: pull-request-available  (was: )

> [Java] Make ComplexCopier agnostic of specific implementation of MapWriter, 
> i.e UnionMapWriter
> --
>
> Key: ARROW-18209
> URL: https://issues.apache.org/jira/browse/ARROW-18209
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Vivek Shankar
>Assignee: Vivek Shankar
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Making ComplexCopier independent of UnionMapWriter lets us use different 
> implementations, like PromotableWriter instead. This helps us to copy map 
> vector with a map value. Otherwise we get the following error:
> {code:java}
> ClassCast class org.apache.arrow.vector.complex.impl.PromotableWriter cannot 
> be cast to class org.apache.arrow.vector.complex.impl.UnionMapWriter{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)