[jira] [Created] (ARROW-18219) [R] read_csv_arrow fails when a string contains a backslash-escaped quote mark followed by a comma
Danielle Navarro created ARROW-18219: Summary: [R] read_csv_arrow fails when a string contains a backslash-escaped quote mark followed by a comma Key: ARROW-18219 URL: https://issues.apache.org/jira/browse/ARROW-18219 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 10.0.0 Reporter: Danielle Navarro `read_csv_arrow()` incorrectly parses CSV files when a string value contains a comma that appears after a backslash-escaped quote mark. Originally noted by Thomas Klebel https://scicomm.xyz/@tklebel/109270436511066953 This is an example that throws the error: ``` r x <- tempfile() readr::write_lines( ' id,text 1,"some text on \\"BLAH\\" and X, and Y also" ', x) cat(system(paste('cat', x), intern = TRUE), sep = "\n") #> #> id,text #> 1,"some text on \"BLAH\" and X, and Y also" arrow::read_csv_arrow(x, escape_backslash = TRUE) #> Error: #> ! Invalid: CSV parse error: Expected 2 columns, got 3: 1,"some text on \"BLAH\" and X, and Y also" #> Backtrace: #> ▆ #> 1. └─arrow (local) ``(file = x, escape_backslash = TRUE, delim = ",") #> 2. └─base::tryCatch(...) at r/R/csv.R:217:2 #> 3. └─base (local) tryCatchList(expr, classes, parentenv, handlers) #> 4. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]]) #> 5. └─value[[3L]](cond) #> 6. └─arrow:::augment_io_error_msg(e, call, schema = schema) at r/R/csv.R:222:6 #> 7. └─rlang::abort(msg, call = call) at r/R/util.R:251:2 ``` Created on 2022-11-02 with [reprex v2.0.2](https://reprex.tidyverse.org) This version includes four lines that might potentially error but do not: ``` r x <- tempfile() readr::write_lines( ' id,text 2,"some text on X and Y" 3,"some text on X, and Y" 4,"some text on \\"BLAH\\" 5,"some text on X and Y, and \\"BLAH\\" also" ', x) cat(system(paste('cat', x), intern = TRUE), sep = "\n") #> #> id,text #> 2,"some text on X and Y" #> 3,"some text on X, and Y" #> 4,"some text on \"BLAH\" #> 5,"some text on X and Y, and \"BLAH\" also" arrow::read_csv_arrow(x, escape_backslash = TRUE) #> # A tibble: 4 × 2 #> id text #> #> 1 2 "some text on X and Y" #> 2 3 "some text on X, and Y" #> 3 4 "some text on \\BLAH\\\"" #> 4 5 "some text on X and Y, and \\BLAH\\\" also\"" ``` Created on 2022-11-02 with [reprex v2.0.2](https://reprex.tidyverse.org) I'm not sure if the problem is R specific. I've partially reproduced the error using reticulate and pyarrow as follows, but notice that this errors at a different point: the pyarrow version appears to fail with the comma preceding the backslash-escaped quote mark: ``` r x <- tempfile() readr::write_lines( ' id,text 1,"some text on X and Y" 2,"some text on X, and Y" 3,"some text on \\"BLAH\\" 4,"some text on X and Y, and \\"BLAH\\" also" 5,"some text on \\"BLAH\\" and X, and Y also" ', x) cat(system(paste('cat', x), intern = TRUE), sep = "\n") #> #> id,text #> 1,"some text on X and Y" #> 2,"some text on X, and Y" #> 3,"some text on \"BLAH\" #> 4,"some text on X and Y, and \"BLAH\" also" #> 5,"some text on \"BLAH\" and X, and Y also" csv <- reticulate::import("pyarrow.csv") opt <- csv$ParseOptions(escape_char='\\') csv$read_csv(x, parse_options = opt) #> Error in py_call_impl(callable, dots$args, dots$keywords): pyarrow.lib.ArrowInvalid: CSV parse error: Expected 2 columns, got 3: 3,"some text on \"BLAH\" #> 4,"some text on X and Y, and \"BLAH\" also" ``` Created on 2022-11-02 with [reprex v2.0.2](https://reprex.tidyverse.org) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18218) [R] read_fwf_arrow
Lucas Mation created ARROW-18218: Summary: [R] read_fwf_arrow Key: ARROW-18218 URL: https://issues.apache.org/jira/browse/ARROW-18218 Project: Apache Arrow Issue Type: New Feature Reporter: Lucas Mation It would be great if `arrow` provided a function to read fixed-width file (FWF) formats. I have asked for help with this in two SO posts ([here|http://example.com]https://stackoverflow.com/questions/74280697/dplyr-way-to-break-variable-into-multiple-columns-acording-to-layout-dictionar/74281380?noredirect=1#comment131145927_74281380 and [here|http://example.com]https://stackoverflow.com/questions/74276222/r-arrow-how-to-read-fwf-format-in-r-using-arrow/74279929#74279929), but have not had much success so far. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18217) Multiple Filesystem subclasses are missing an override for Equals
[ https://issues.apache.org/jira/browse/ARROW-18217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vyas Ramasubramani updated ARROW-18217: --- Description: Currently the `Filesystem` class contains two overloads for the `Equals` method: {{virtual bool Equals(const FileSystem& other) const = 0;}} {{virtual bool Equals(const std::shared_ptr& other) const}} {{{ return Equals(*other); }}} The second is a trivial call to the first for ease of use. The first method is pure virtual and _must_ be overridden by subclasses. The problem is that overriding a single overload of a method also shadows all other overloads. As a result, it is no longer possible to call the `shared_ptr` version of the method. This appears to be the case for the `SubTreeFileSystem` and the `SlowFileSystem` in `filesystem.h` as well as the `S3FileSystem` in `s3fs.h`. There may be other classes with this problem as well, those are just the ones that I noticed. My guess is that what was intended here is to pull the method into the child class's namespace via a using declaration i.e. add `using FileSystem::Equals` to each child class. was: Currently the `Filesystem` class contains two overloads for the `Equals` method: ``` virtual bool Equals(const FileSystem& other) const = 0; virtual bool Equals(const std::shared_ptr& other) const { return Equals(*other); } ``` The second is a trivial call to the first for ease of use. The first method is pure virtual and _must_ be overridden by subclasses. The problem is that overriding a single overload of a method also shadows all other overloads. As a result, it is no longer possible to call the `shared_ptr` version of the method. This appears to be the case for the `SubTreeFileSystem` and the `SlowFileSystem` in `filesystem.h` as well as the `S3FileSystem` in `s3fs.h`. There may be other classes with this problem as well, those are just the ones that I noticed. My guess is that what was intended here is to pull the method into the child class's namespace via a using declaration i.e. add `using FileSystem::Equals` to each child class. > Multiple Filesystem subclasses are missing an override for Equals > - > > Key: ARROW-18217 > URL: https://issues.apache.org/jira/browse/ARROW-18217 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 9.0.0, 10.0.0 >Reporter: Vyas Ramasubramani >Priority: Minor > > Currently the `Filesystem` class contains two overloads for the `Equals` > method: > {{virtual bool Equals(const FileSystem& other) const = 0;}} > {{virtual bool Equals(const std::shared_ptr& other) const}} > {{{ return Equals(*other); }}} > The second is a trivial call to the first for ease of use. The first method > is pure virtual and _must_ be overridden by subclasses. The problem is that > overriding a single overload of a method also shadows all other overloads. As > a result, it is no longer possible to call the `shared_ptr` version of the > method. This appears to be the case for the `SubTreeFileSystem` and the > `SlowFileSystem` in `filesystem.h` as well as the `S3FileSystem` in `s3fs.h`. > There may be other classes with this problem as well, those are just the ones > that I noticed. My guess is that what was intended here is to pull the method > into the child class's namespace via a using declaration i.e. add `using > FileSystem::Equals` to each child class. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18217) Multiple Filesystem subclasses are missing an override for Equals
[ https://issues.apache.org/jira/browse/ARROW-18217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vyas Ramasubramani updated ARROW-18217: --- Description: Currently the `Filesystem` class contains two overloads for the `Equals` method: {{virtual bool Equals(const FileSystem& other) const = 0;}} {{virtual bool Equals(const std::shared_ptr& other) const}} { return Equals(*other); } The second is a trivial call to the first for ease of use. The first method is pure virtual and _must_ be overridden by subclasses. The problem is that overriding a single overload of a method also shadows all other overloads. As a result, it is no longer possible to call the `shared_ptr` version of the method. This appears to be the case for the `SubTreeFileSystem` and the `SlowFileSystem` in `filesystem.h` as well as the `S3FileSystem` in `s3fs.h`. There may be other classes with this problem as well, those are just the ones that I noticed. My guess is that what was intended here is to pull the method into the child class's namespace via a using declaration i.e. add `using FileSystem::Equals` to each child class. was: Currently the `Filesystem` class contains two overloads for the `Equals` method: {{virtual bool Equals(const FileSystem& other) const = 0;}} {{virtual bool Equals(const std::shared_ptr& other) const}} {{{ return Equals(*other); }}} The second is a trivial call to the first for ease of use. The first method is pure virtual and _must_ be overridden by subclasses. The problem is that overriding a single overload of a method also shadows all other overloads. As a result, it is no longer possible to call the `shared_ptr` version of the method. This appears to be the case for the `SubTreeFileSystem` and the `SlowFileSystem` in `filesystem.h` as well as the `S3FileSystem` in `s3fs.h`. There may be other classes with this problem as well, those are just the ones that I noticed. My guess is that what was intended here is to pull the method into the child class's namespace via a using declaration i.e. add `using FileSystem::Equals` to each child class. > Multiple Filesystem subclasses are missing an override for Equals > - > > Key: ARROW-18217 > URL: https://issues.apache.org/jira/browse/ARROW-18217 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 9.0.0, 10.0.0 >Reporter: Vyas Ramasubramani >Priority: Minor > > Currently the `Filesystem` class contains two overloads for the `Equals` > method: > {{virtual bool Equals(const FileSystem& other) const = 0;}} > {{virtual bool Equals(const std::shared_ptr& other) const}} > { return Equals(*other); } > The second is a trivial call to the first for ease of use. The first method > is pure virtual and _must_ be overridden by subclasses. The problem is that > overriding a single overload of a method also shadows all other overloads. As > a result, it is no longer possible to call the `shared_ptr` version of the > method. This appears to be the case for the `SubTreeFileSystem` and the > `SlowFileSystem` in `filesystem.h` as well as the `S3FileSystem` in `s3fs.h`. > There may be other classes with this problem as well, those are just the ones > that I noticed. My guess is that what was intended here is to pull the method > into the child class's namespace via a using declaration i.e. add `using > FileSystem::Equals` to each child class. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18217) Multiple Filesystem subclasses are missing an override for Equals
Vyas Ramasubramani created ARROW-18217: -- Summary: Multiple Filesystem subclasses are missing an override for Equals Key: ARROW-18217 URL: https://issues.apache.org/jira/browse/ARROW-18217 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 10.0.0, 9.0.0 Reporter: Vyas Ramasubramani Currently the `Filesystem` class contains two overloads for the `Equals` method: ``` virtual bool Equals(const FileSystem& other) const = 0; virtual bool Equals(const std::shared_ptr& other) const { return Equals(*other); } ``` The second is a trivial call to the first for ease of use. The first method is pure virtual and _must_ be overridden by subclasses. The problem is that overriding a single overload of a method also shadows all other overloads. As a result, it is no longer possible to call the `shared_ptr` version of the method. This appears to be the case for the `SubTreeFileSystem` and the `SlowFileSystem` in `filesystem.h` as well as the `S3FileSystem` in `s3fs.h`. There may be other classes with this problem as well, those are just the ones that I noticed. My guess is that what was intended here is to pull the method into the child class's namespace via a using declaration i.e. add `using FileSystem::Equals` to each child class. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18207) Rubygems not updating in concert with majors
[ https://issues.apache.org/jira/browse/ARROW-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627379#comment-17627379 ] Kouhei Sutou commented on ARROW-18207: -- I think that we can't decide it without data. BTW, the problem can be fixed by improving the extpp gem. So I've improved and released a new extpp gem. > Rubygems not updating in concert with majors > > > Key: ARROW-18207 > URL: https://issues.apache.org/jira/browse/ARROW-18207 > Project: Apache Arrow > Issue Type: Bug > Components: Ruby >Affects Versions: 10.0.0 >Reporter: Noah Horton >Assignee: Kouhei Sutou >Priority: Major > > 10.0.0 just released, meaning that that all install scripts that use the > 'latest' tag are getting it. > Yet rubygems.org is still running with the 9.0.0 version a week after 10.0.0 > released. > The build scripts need to start updating rubygems.org automatically, or guide > users to a bundler config like > {code:ruby} > gem "red-arrow", github: "apache/arrow", glob: "ruby/red-arrow/*.gemspec", > require: "arrow", tag: 'apache-arrow-10.0.0' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18215) [R] User experience improvements
[ https://issues.apache.org/jira/browse/ARROW-18215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-18215: - Description: Umbrella ticket to collect together tickets relating to improving error messages, and general user experience tweaks (was: Umbrella ticket to collect together tickets relating to improving error messages, and general dev-experience tweaks) > [R] User experience improvements > > > Key: ARROW-18215 > URL: https://issues.apache.org/jira/browse/ARROW-18215 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Priority: Major > > Umbrella ticket to collect together tickets relating to improving error > messages, and general user experience tweaks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-18199) [R] Misleading error message in query using across()
[ https://issues.apache.org/jira/browse/ARROW-18199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane reassigned ARROW-18199: Assignee: (was: Nicola Crane) > [R] Misleading error message in query using across() > > > Key: ARROW-18199 > URL: https://issues.apache.org/jira/browse/ARROW-18199 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Priority: Critical > > Error handling looks like it's happening in the wrong place - a comma has > been missed in the {{select()}} but it's wrongly appearing like it's an issue > with {{across()}}. Can we do something to make this not happen? > {code:r} > download.file( > url = > "https://github.com/djnavarro/arrow-user2022/releases/download/v0.1/nyc-taxi-tiny.zip;, > destfile = here::here("data/nyc-taxi-tiny.zip") > ) > library(arrow) > library(dplyr) > open_dataset("data") %>% > select(pickup_datetime, pickup_longitude, pickup_latitude > ends_with("amount")) %>% > mutate(across(ends_with("amount"), ~.x * 0.87, .names = "{.col}_gbp")) %>% > collect() > {code} > {code:r} > Error in `across()`: > ! Must be used inside dplyr verbs. > Run `rlang::last_error()` to see where the error occurred. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-18200) [R] Misleading error message if opening CSV dataset with invalid file in directory
[ https://issues.apache.org/jira/browse/ARROW-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane reassigned ARROW-18200: Assignee: (was: Nicola Crane) > [R] Misleading error message if opening CSV dataset with invalid file in > directory > -- > > Key: ARROW-18200 > URL: https://issues.apache.org/jira/browse/ARROW-18200 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Priority: Major > > I made a mistake before where I thought a dataset contained CSVs which were, > in fact, Parquet files, but the error message I got was super unhelpful > {code:r} > library(arrow) > download.file( > url = > "https://github.com/djnavarro/arrow-user2022/releases/download/v0.1/nyc-taxi-tiny.zip;, > destfile = here::here("data/nyc-taxi-tiny.zip") > ) > # (unzip the zip file into the data directory but don't delete it after) > open_dataset("data", format = "csv") > {code} > {code:r} > Error in nchar(x) : invalid multibyte string, element 1 > In addition: Warning message: > In grepl("No match for FieldRef.Name(__filename)", msg, fixed = TRUE) : > input string 1 is invalid in this locale > {code} > Note, this only occurs with {{format="csv"}} and omitting this argument (i.e. > the default of {{format="parquet"}} leaves us with the much better error: > {code:r} > Error in `open_dataset()`: > ! Invalid: Error creating dataset. Could not read schema from > '/home/nic2/arrow_10_twitter/data/nyc-taxi-tiny.zip': Could not open Parquet > input source '/home/nic2/arrow_10_twitter/data/nyc-taxi-tiny.zip': Parquet > magic bytes not found in footer. Either the file is corrupted or this is not > a parquet file. > /home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:338 GetReader(source, > scan_options). Is this a 'parquet' file? > /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:44 > InspectSchemas(std::move(options)) > /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:265 > Inspect(options.inspect_options) > ℹ Did you mean to specify a 'format' other than the default (parquet)? > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17607) [R] Add as_scalar()
[ https://issues.apache.org/jira/browse/ARROW-17607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-17607: - Parent: ARROW-18215 Issue Type: Sub-task (was: New Feature) > [R] Add as_scalar() > --- > > Key: ARROW-17607 > URL: https://issues.apache.org/jira/browse/ARROW-17607 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Neal Richardson >Priority: Major > > There's as_everything_else but not as_scalar. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18216) [R] Better error message when creating an array from decimals
Nicola Crane created ARROW-18216: Summary: [R] Better error message when creating an array from decimals Key: ARROW-18216 URL: https://issues.apache.org/jira/browse/ARROW-18216 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane We should first check why this doesn't work, and if we can instead fix the problem instead of the error message {code:r} > ChunkedArray$create(c(1.4, 525.5), type = decimal(precision = 1, scale = 3)) Error: NotImplemented: Extend {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18200) [R] Misleading error message if opening CSV dataset with invalid file in directory
[ https://issues.apache.org/jira/browse/ARROW-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-18200: - Parent: ARROW-18215 Issue Type: Sub-task (was: Bug) > [R] Misleading error message if opening CSV dataset with invalid file in > directory > -- > > Key: ARROW-18200 > URL: https://issues.apache.org/jira/browse/ARROW-18200 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Nicola Crane >Priority: Major > > I made a mistake before where I thought a dataset contained CSVs which were, > in fact, Parquet files, but the error message I got was super unhelpful > {code:r} > library(arrow) > download.file( > url = > "https://github.com/djnavarro/arrow-user2022/releases/download/v0.1/nyc-taxi-tiny.zip;, > destfile = here::here("data/nyc-taxi-tiny.zip") > ) > # (unzip the zip file into the data directory but don't delete it after) > open_dataset("data", format = "csv") > {code} > {code:r} > Error in nchar(x) : invalid multibyte string, element 1 > In addition: Warning message: > In grepl("No match for FieldRef.Name(__filename)", msg, fixed = TRUE) : > input string 1 is invalid in this locale > {code} > Note, this only occurs with {{format="csv"}} and omitting this argument (i.e. > the default of {{format="parquet"}} leaves us with the much better error: > {code:r} > Error in `open_dataset()`: > ! Invalid: Error creating dataset. Could not read schema from > '/home/nic2/arrow_10_twitter/data/nyc-taxi-tiny.zip': Could not open Parquet > input source '/home/nic2/arrow_10_twitter/data/nyc-taxi-tiny.zip': Parquet > magic bytes not found in footer. Either the file is corrupted or this is not > a parquet file. > /home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:338 GetReader(source, > scan_options). Is this a 'parquet' file? > /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:44 > InspectSchemas(std::move(options)) > /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:265 > Inspect(options.inspect_options) > ℹ Did you mean to specify a 'format' other than the default (parquet)? > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18176) [R] arrow::open_dataset %>% select(myvars) %>% collect causes memory leak
[ https://issues.apache.org/jira/browse/ARROW-18176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lucas Mation updated ARROW-18176: - Description: I first posted on StackOverlow, [here.|https://stackoverflow.com/questions/74221492/r-arrow-open-dataset-selectmyvars-collect-causing-memory-leak] I am having trouble using arrow in R. First, I saved some {{data.tables}} that were about 50-60Gb ({{{}d{}}} in the code chunk) in memory to a parquet file using: {{d %>% write_dataset(f, format='parquet') # f is the directory name}} Then I try to read open the file, select the relevant variables and {{tic()d2 <- open_dataset(f) %>% select(all_of(myvars)) %>% collect #myvars is a vector of variable namestoc()}} I did this conversion for 3 sets of data.tables (unfortunately, data is confidential so I can't include in the example). In one set, I was able to {{open>select>collect}} the desired table in about 60s, obtaining a 10Gb file (after variable selection). For the other two sets, the command caused a memory leak. tic()-toc() returned after 80s. But the object name (d2) never appeared in Rstudio's "Enviroment panel", and memory used keeps creeping up until it occupied most of the available RAM of the server, and then R crashed. Note the orginal dataset, without subsetting cols, was smaller than 60Gb and the server had 512GB. Any ideas on what could be going on here? UPDATE: today I noticed a few more things. 1) If the collected object is small enough (3 cols, 66million rows), R will unfreeze. The console becomes responsive, the object shows up in the Environment panel. But memory use keeps going up (by small amounts because the underlying that is small). While this is helpening, issuing a gc() command reduces the memory use, but it then starts growing again. 2) Even after "rm(d2)" and "gc()", the R session that issued the arrow commands still use around 60-70Gb of RAM... The only way to end that is to close the R session. 3) I am using arrow 10.0.0 was: I first posted on StackOverlow, [here.|https://stackoverflow.com/questions/74221492/r-arrow-open-dataset-selectmyvars-collect-causing-memory-leak] I am having trouble using arrow in R. First, I saved some {{data.tables}} that were about 50-60Gb ({{{}d{}}} in the code chunk) in memory to a parquet file using: {{d %>% write_dataset(f, format='parquet') # f is the directory name}} Then I try to read open the file, select the relevant variables and {{tic()d2 <- open_dataset(f) %>% select(all_of(myvars)) %>% collect #myvars is a vector of variable namestoc()}} I did this conversion for 3 sets of data.tables (unfortunately, data is confidential so I can't include in the example). In one set, I was able to {{open>select>collect}} the desired table in about 60s, obtaining a 10Gb file (after variable selection). For the other two sets, the command caused a memory leak. tic()-toc() returned after 80s. But the object name (d2) never appeared in Rstudio's "Enviroment panel", and memory used keeps creeping up until it occupied most of the available RAM of the server, and then R crashed. Note the orginal dataset, without subsetting cols, was smaller than 60Gb and the server had 512GB. Any ideas on what could be going on here? UPDATE: today I noticed a few more things. 1) If the collected object is small enough (3 cols, 66million rows), R will unfreeze. The console becomes responsive, the object shows up in the Environment panel. But memory use keeps going up (by small amounts because the underlying that is small). While this is helpening, issuing a gc() command reduces the memory use, but it then starts growing again. 2) Even after "rm(d2)" and "gc()", the R session that issued the arrow commands still use around 60-70Gb of RAM... The only way to end that is to close the R session. > [R] arrow::open_dataset %>% select(myvars) %>% collect causes memory leak > - > > Key: ARROW-18176 > URL: https://issues.apache.org/jira/browse/ARROW-18176 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lucas Mation >Priority: Critical > > I first posted on StackOverlow, > [here.|https://stackoverflow.com/questions/74221492/r-arrow-open-dataset-selectmyvars-collect-causing-memory-leak] > I am having trouble using arrow in R. First, I saved some {{data.tables}} > that were about 50-60Gb ({{{}d{}}} in the code chunk) in memory to a parquet > file using: > > {{d %>% write_dataset(f, format='parquet') # f is the directory name}} > Then I try to read open the file, select the relevant variables and > > {{tic()d2 <- open_dataset(f) %>% select(all_of(myvars)) %>% collect #myvars > is a vector of variable namestoc()}} > I did this conversion for 3 sets of data.tables (unfortunately, data is >
[jira] [Created] (ARROW-18215) [R] User experience improvements
Nicola Crane created ARROW-18215: Summary: [R] User experience improvements Key: ARROW-18215 URL: https://issues.apache.org/jira/browse/ARROW-18215 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Umbrella ticket to collect together tickets relating to improving error messages, and general dev-experience tweaks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18199) [R] Misleading error message in query using across()
[ https://issues.apache.org/jira/browse/ARROW-18199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane updated ARROW-18199: - Parent: ARROW-18215 Issue Type: Sub-task (was: Bug) > [R] Misleading error message in query using across() > > > Key: ARROW-18199 > URL: https://issues.apache.org/jira/browse/ARROW-18199 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Nicola Crane >Priority: Critical > > Error handling looks like it's happening in the wrong place - a comma has > been missed in the {{select()}} but it's wrongly appearing like it's an issue > with {{across()}}. Can we do something to make this not happen? > {code:r} > download.file( > url = > "https://github.com/djnavarro/arrow-user2022/releases/download/v0.1/nyc-taxi-tiny.zip;, > destfile = here::here("data/nyc-taxi-tiny.zip") > ) > library(arrow) > library(dplyr) > open_dataset("data") %>% > select(pickup_datetime, pickup_longitude, pickup_latitude > ends_with("amount")) %>% > mutate(across(ends_with("amount"), ~.x * 0.87, .names = "{.col}_gbp")) %>% > collect() > {code} > {code:r} > Error in `across()`: > ! Must be used inside dplyr verbs. > Run `rlang::last_error()` to see where the error occurred. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-18051) [C++] Enable tests skipped by ARROW-16392
[ https://issues.apache.org/jira/browse/ARROW-18051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vibhatha Lakmal Abeykoon reassigned ARROW-18051: Assignee: Vibhatha Lakmal Abeykoon > [C++] Enable tests skipped by ARROW-16392 > - > > Key: ARROW-18051 > URL: https://issues.apache.org/jira/browse/ARROW-18051 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > There are a number of unit tests that we still skip (on Windows) due to > ARROW-16392. However, ARROW-16392 has been fixed. There is no reason to > skip these any longer. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-18183) [C++] cpp-micro benchmarks are failing on mac arm machine
[ https://issues.apache.org/jira/browse/ARROW-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace reassigned ARROW-18183: --- Assignee: Weston Pace > [C++] cpp-micro benchmarks are failing on mac arm machine > - > > Key: ARROW-18183 > URL: https://issues.apache.org/jira/browse/ARROW-18183 > Project: Apache Arrow > Issue Type: Bug > Components: Benchmarking, C++ >Reporter: Elena Henderson >Assignee: Weston Pace >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18183) [C++]cpp-micro benchmarks are failing on mac arm machine
[ https://issues.apache.org/jira/browse/ARROW-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-18183: --- Labels: pull-request-available (was: ) > [C++]cpp-micro benchmarks are failing on mac arm machine > > > Key: ARROW-18183 > URL: https://issues.apache.org/jira/browse/ARROW-18183 > Project: Apache Arrow > Issue Type: Bug > Components: Benchmarking, C++ >Reporter: Elena Henderson >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18183) [C++] cpp-micro benchmarks are failing on mac arm machine
[ https://issues.apache.org/jira/browse/ARROW-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-18183: Summary: [C++] cpp-micro benchmarks are failing on mac arm machine (was: [C++]cpp-micro benchmarks are failing on mac arm machine) > [C++] cpp-micro benchmarks are failing on mac arm machine > - > > Key: ARROW-18183 > URL: https://issues.apache.org/jira/browse/ARROW-18183 > Project: Apache Arrow > Issue Type: Bug > Components: Benchmarking, C++ >Reporter: Elena Henderson >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18183) [C++]cpp-micro benchmarks are failing on mac arm machine
[ https://issues.apache.org/jira/browse/ARROW-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-18183: Summary: [C++]cpp-micro benchmarks are failing on mac arm machine (was: cpp-micro benchmarks are failing on mac arm machine) > [C++]cpp-micro benchmarks are failing on mac arm machine > > > Key: ARROW-18183 > URL: https://issues.apache.org/jira/browse/ARROW-18183 > Project: Apache Arrow > Issue Type: Bug > Components: Benchmarking, C++ >Reporter: Elena Henderson >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18148) [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather
[ https://issues.apache.org/jira/browse/ARROW-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627350#comment-17627350 ] Danielle Navarro commented on ARROW-18148: -- Agreed. For the current PR I'll write it as though the API weren't changing, but will still preference the term "Arrow" over "Feather" where that's relevant > [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather > -- > > Key: ARROW-18148 > URL: https://issues.apache.org/jira/browse/ARROW-18148 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R >Reporter: Stephanie Hazlitt >Priority: Minor > Labels: feather > > Following up from [this mailing list > conversation|https://lists.apache.org/thread/nxncph842h8tyovxp04hrzq4y35lq4xq], > I am wondering if the R package should rename `read_ipc_file()` / > write_ipc_file()` to `read_arrow_file()`/ `write_arrow_file()`, or add an > additional alias for both. It might also be helpful to update the > documentation so that users read "Write an Arrow file (formerly known as a > Feather file)" rather than the current Feather-named first approach, assuming > there is a community decision to coalesce around the name Arrow for the file > format, and the project is moving on from the name Feather. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18114) [R] unify_schemas=FALSE does not improve open_dataset() read times
[ https://issues.apache.org/jira/browse/ARROW-18114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627339#comment-17627339 ] Carl Boettiger commented on ARROW-18114: Thanks Weston! Any update here? > [R] unify_schemas=FALSE does not improve open_dataset() read times > -- > > Key: ARROW-18114 > URL: https://issues.apache.org/jira/browse/ARROW-18114 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Carl Boettiger >Priority: Major > > open_dataset() provides the very helpful optional argument to set > unify_schemas=FALSE, which should allow arrow to inspect a single parquet > file instead of touching potentially thousands or more parquet files to > determine a consistent unified schema. This ought to provide a substantial > performance increase in contexts where the schema is known in advance. > Unfortunately, in my tests it seems to have no impact on performance. > Consider the following reprexes: > default, unify_schemas=TRUE > {code:java} > library(arrow) > ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", > endpoint_override = "data.ecoforecast.org", anonymous=TRUE) > bench::bench_time( > { open_dataset(ex) } > ){code} > about 32 seconds for me. > manual, unify_schemas=FALSE: > {code:java} > bench::bench_time({ > open_dataset(ex, unify_schemas = FALSE) > }){code} > takes about 32 seconds as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18214) [R] Use ISO 8601 in character representations of datetimes?
[ https://issues.apache.org/jira/browse/ARROW-18214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carl Boettiger updated ARROW-18214: --- Description: Arrow needs to represent datetime / timestamp values as character strings, e.g. when writing to CSV or when generating partitions on timestamp-valued column. When this occurs, Arrow generates a string such as: "2022-11-01 21:12:46.771925+" In particular, this uses a space instead of a T between the date and time components. I believe either is permitted in [RFC 3339|https://www.rfc-editor.org/rfc/rfc3339.html#section-5] ??5.6. NOTE: ISO 8601 defines date and time separated by "T". Applications using this syntax may choose, for the sake of readability, to specify a full-date and full-time separated by (say) a space character.?? But as RFC 3339 notes, this is not valid under ISO 8601. It would be preferable to stick to the stricter ISO 8601 convention. was: Arrow needs to represent datetime / timestamp values as character strings, e.g. when writing to CSV or when generating partitions on timestamp-valued column. When this occurs, Arrow generates a string such as: "2022-11-01 21:12:46.771925+" In particular, this uses a space instead of a T between the date and time components. I believe either is permitted in [RFC 3339|https://www.rfc-editor.org/rfc/rfc3339.html#section-5] ??5.6. NOTE: ISO 8601 defines date and time separated by "T". Applications using this syntax may choose, for the sake of readability, to specify a full-date and full-time separated by (say) a space character.?? But as RFC 3339 notes, this is not valid under ISO 8601. It would be preferable to stick to the stricter ISO 8601 convention. This would be more consistent with other software. > [R] Use ISO 8601 in character representations of datetimes? > --- > > Key: ARROW-18214 > URL: https://issues.apache.org/jira/browse/ARROW-18214 > Project: Apache Arrow > Issue Type: Bug >Reporter: Carl Boettiger >Priority: Major > > Arrow needs to represent datetime / timestamp values as character strings, > e.g. when writing to CSV or when generating partitions on timestamp-valued > column. When this occurs, Arrow generates a string such as: > "2022-11-01 21:12:46.771925+" > In particular, this uses a space instead of a T between the date and time > components. I believe either is permitted in [RFC > 3339|https://www.rfc-editor.org/rfc/rfc3339.html#section-5] > ??5.6. NOTE: ISO 8601 defines date and time separated by "T". Applications > using this syntax may choose, for the sake of readability, to specify a > full-date and full-time separated by (say) a space character.?? > > But as RFC 3339 notes, this is not valid under ISO 8601. It would be > preferable to stick to the stricter ISO 8601 convention. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18214) [R] Use ISO 8601 in character representations of datetimes?
Carl Boettiger created ARROW-18214: -- Summary: [R] Use ISO 8601 in character representations of datetimes? Key: ARROW-18214 URL: https://issues.apache.org/jira/browse/ARROW-18214 Project: Apache Arrow Issue Type: Bug Reporter: Carl Boettiger Arrow needs to represent datetime / timestamp values as character strings, e.g. when writing to CSV or when generating partitions on timestamp-valued column. When this occurs, Arrow generates a string such as: "2022-11-01 21:12:46.771925+" In particular, this uses a space instead of a T between the date and time components. I believe either is permitted in [RFC 3339|https://www.rfc-editor.org/rfc/rfc3339.html#section-5] ??5.6. NOTE: ISO 8601 defines date and time separated by "T". Applications using this syntax may choose, for the sake of readability, to specify a full-date and full-time separated by (say) a space character.?? But as RFC 3339 notes, this is not valid under ISO 8601. It would be preferable to stick to the stricter ISO 8601 convention. This would be more consistent with other software. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17640) [C++] Add File Handling Test cases for GlobFile handling in Substrait Read
[ https://issues.apache.org/jira/browse/ARROW-17640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace resolved ARROW-17640. - Fix Version/s: 11.0.0 Resolution: Fixed Issue resolved by pull request 14132 [https://github.com/apache/arrow/pull/14132] > [C++] Add File Handling Test cases for GlobFile handling in Substrait Read > -- > > Key: ARROW-17640 > URL: https://issues.apache.org/jira/browse/ARROW-17640 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > At the moment the `GlobFiles` function hasn't been tested with an end-to-end > Substrait-To-Arrow case. Also observed that leading slash is ignored in this > API. Proposed changes are adding test cases to cover the fix including > end-to-end test cases. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
[ https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627330#comment-17627330 ] Kouhei Sutou commented on ARROW-17374: -- Could you provide command lines to reproduce this with conda? > [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND > -- > > Key: ARROW-17374 > URL: https://issues.apache.org/jira/browse/ARROW-17374 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0, 8.0.1, 9.0.0 > Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64 >Reporter: Shane Brennan >Priority: Blocker > Attachments: build-images.out > > > I've been trying to install Arrow on an R notebook within AWS SageMaker. > SageMaker provides Jupyter-like notebooks, with each instance running Amazon > Linux 2 as its OS, itself based on RHEL. > Trying to install a few ways, e.g., using the standard binaries, using the > nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all > still result in the following error. > {noformat} > x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared > -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common > -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags > -Wl,--gc-sections -Wl,--allow-shlib-undefined > -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib > -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib > -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o > array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o > compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o > expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o > json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o > recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o > schema.o symbols.o table.o threadpool.o type_infer.o > -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib > -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz > SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread > -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto > -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR > x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or > directory > make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: > arrow.so] Error 1{noformat} > Snappy is installed on the systems, and both shared object (.so) and cmake > files are there, where I've tried setting the system env variables Snappy_DIR > and Snappy_LIB to point at them, but to no avail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18161) [Ruby] Tables can have buffers get GC'ed
[ https://issues.apache.org/jira/browse/ARROW-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-18161: - Summary: [Ruby] Tables can have buffers get GC'ed (was: Ruby Arrow Tables can have buffers get GC'ed) > [Ruby] Tables can have buffers get GC'ed > > > Key: ARROW-18161 > URL: https://issues.apache.org/jira/browse/ARROW-18161 > Project: Apache Arrow > Issue Type: Bug > Components: Ruby >Affects Versions: 9.0.0 > Environment: Ruby 3.1.2 >Reporter: Noah Horton >Assignee: Kouhei Sutou >Priority: Major > > ven an Arrow::Table with several columns "X" > > {code:ruby} > # Rails console outputs > 3.1.2 :107 > x.schema > => > # dates: date32[day] > expected_values: double> > 3.1.2 :108 > x.schema > => > # dates: date32[day] > expected_values: double> > 3.1.2 :109 > {code} > Note that the object and pointer have both changed values. > But the far bigger issue is that repeated reads from it will cause different > results: > {code:ruby} > 3.1.2 :097 > x[1][0] > => Sun, 22 Aug 2021 > 3.1.2 :098 > x[1][1] > => nil > 3.1.2 :099 > x[1][0] > => nil {code} > I have a lot of issues like this - when I have done these types of read > operations, I get the original table with the data in the columns all > shuffled around or deleted. > I do ingest the data slightly oddly in the first place as it comes in over > GRPC and I am using Arrow::Buffer to read it from the GRPC and then passing > that into Arrow::Table.load. But I would not expect that once it was in > Arrow::Table that I could do anything to permute it unintentionally. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-18186) [C++][MinGW] Fail to build with clang
[ https://issues.apache.org/jira/browse/ARROW-18186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-18186. -- Fix Version/s: 11.0.0 Resolution: Fixed Issue resolved by pull request 14536 [https://github.com/apache/arrow/pull/14536] > [C++][MinGW] Fail to build with clang > - > > Key: ARROW-18186 > URL: https://issues.apache.org/jira/browse/ARROW-18186 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > https://github.com/kou/arrow/actions/runs/3342340048/jobs/5534465173#step:7:768 > {noformat} > FAILED: src/arrow/CMakeFiles/arrow_shared.dir/util/int_util.cc.obj > D:\a\_temp\msys64\clang64\bin\ccache.exe > D:\a\_temp\msys64\clang64\bin\c++.exe -DARROW_EXPORTING > -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_BMI2 > -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 -DARROW_WITH_BROTLI > -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_RE2 -DARROW_WITH_SNAPPY > -DARROW_WITH_UTF8PROC -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD > -DAWS_AUTH_USE_IMPORT_EXPORT -DAWS_CAL_USE_IMPORT_EXPORT > -DAWS_CHECKSUMS_USE_IMPORT_EXPORT -DAWS_COMMON_USE_IMPORT_EXPORT > -DAWS_COMPRESSION_USE_IMPORT_EXPORT -DAWS_CRT_CPP_USE_IMPORT_EXPORT > -DAWS_EVENT_STREAM_USE_IMPORT_EXPORT -DAWS_HTTP_USE_IMPORT_EXPORT > -DAWS_IO_USE_IMPORT_EXPORT -DAWS_MQTT_USE_IMPORT_EXPORT > -DAWS_MQTT_WITH_WEBSOCKETS -DAWS_S3_USE_IMPORT_EXPORT > -DAWS_SDKUTILS_USE_IMPORT_EXPORT -DAWS_SDK_VERSION_MAJOR=1 > -DAWS_SDK_VERSION_MINOR=9 -DAWS_SDK_VERSION_PATCH=367 > -DAWS_USE_IO_COMPLETION_PORTS -DBOOST_ALL_DYN_LINK -DBOOST_ALL_NO_LIB > -DURI_STATIC_BUILD -DUSE_IMPORT_EXPORT -DUSE_IMPORT_EXPORT=1 > -DUSE_WINDOWS_DLL_SEMANTICS -D_CRT_SECURE_NO_WARNINGS > -D_ENABLE_EXTENDED_ALIGNED_STORAGE -Darrow_shared_EXPORTS > -ID:/a/arrow/arrow/build/cpp/src -ID:/a/arrow/arrow/cpp/src > -ID:/a/arrow/arrow/cpp/src/generated -isystem > D:/a/arrow/arrow/cpp/thirdparty/flatbuffers/include -isystem > D:/a/arrow/arrow/cpp/thirdparty/hadoop/include -isystem > D:/a/arrow/arrow/build/cpp/google_cloud_cpp_ep-install/include -isystem > D:/a/arrow/arrow/build/cpp/crc32c_ep-install/include -Qunused-arguments > -fcolor-diagnostics -O2 -DNDEBUG -Wa,-mbig-obj -Wall -Wextra -Wdocumentation > -Wshorten-64-to-32 -Wno-missing-braces -Wno-unused-parameter > -Wno-constant-logical-operand -Wno-return-stack-address > -Wno-unknown-warning-option -Wno-pass-failed -mxsave -msse4.2 -DNDEBUG > -pthread -std=c++17 -MD -MT > src/arrow/CMakeFiles/arrow_shared.dir/util/int_util.cc.obj -MF > src\arrow\CMakeFiles\arrow_shared.dir\util\int_util.cc.obj.d -o > src/arrow/CMakeFiles/arrow_shared.dir/util/int_util.cc.obj -c > D:/a/arrow/arrow/cpp/src/arrow/util/int_util.cc > D:/a/arrow/arrow/cpp/src/arrow/util/int_util.cc:463:1: error: an attribute > list cannot appear here > INSTANTIATE_ALL() > ^ > D:/a/arrow/arrow/cpp/src/arrow/util/int_util.cc:454:3: note: expanded from > macro 'INSTANTIATE_ALL' > INSTANTIATE_ALL_DEST(uint8_t) \ > ^ > D:/a/arrow/arrow/cpp/src/arrow/util/int_util.cc:444:3: note: expanded from > macro 'INSTANTIATE_ALL_DEST' > INSTANTIATE(uint8_t, DEST) \ > ^~ > D:/a/arrow/arrow/cpp/src/arrow/util/int_util.cc:440:12: note: expanded from > macro 'INSTANTIATE' > template ARROW_TEMPLATE_EXPORT void TransposeInts( \ >^ > D:/a/arrow/arrow/cpp/src/arrow/util/visibility.h:47:31: note: expanded from > macro 'ARROW_TEMPLATE_EXPORT' > #define ARROW_TEMPLATE_EXPORT ARROW_DLLEXPORT > ^~~ > D:/a/arrow/arrow/cpp/src/arrow/util/visibility.h:32:25: note: expanded from > macro 'ARROW_DLLEXPORT' > #define ARROW_DLLEXPORT [[gnu::dllexport]] > ^~ > ... > [127/801] Building CXX object > src/arrow/CMakeFiles/arrow_shared.dir/util/io_util.cc.obj > D:/a/arrow/arrow/cpp/src/arrow/util/io_util.cc:1079:7: warning: variable > 'oflag' set but not used [-Wunused-but-set-variable] > int oflag = _O_CREAT | _O_BINARY | _O_NOINHERIT; > ^ > D:/a/arrow/arrow/cpp/src/arrow/util/io_util.cc:1545:29: warning: missing > field 'InternalHigh' initializer [-Wmissing-field-initializers] > OVERLAPPED overlapped = {0}; > ^ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17288) [C++] Create fragment scanners for csv/parquet/orc/ipc
[ https://issues.apache.org/jira/browse/ARROW-17288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-17288: - Assignee: (was: Weston Pace) > [C++] Create fragment scanners for csv/parquet/orc/ipc > -- > > Key: ARROW-17288 > URL: https://issues.apache.org/jira/browse/ARROW-17288 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Weston Pace >Priority: Major > > Once we have the basic scan node ready (with an initial implementation based > on in-memory fragments) then we can add over the file-format versions. We > may also want to consider adding JSON support for datasets at this time. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18213) [R] Arrow 10 silently dropping missing values/blanks
Lorenzo Isella created ARROW-18213: -- Summary: [R] Arrow 10 silently dropping missing values/blanks Key: ARROW-18213 URL: https://issues.apache.org/jira/browse/ARROW-18213 Project: Apache Arrow Issue Type: Bug Reporter: Lorenzo Isella In the example below a single column text file is written to disk. It contains some blanks and when it is opened and collected, the blank values are silently dropped. I did not test this behavior on arrow 9.0. {code:java} library(tidyverse) library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp ll <- c( "100", "1000", "200" , "3000" , "50" , "500", "" , "Not Range") df <- tibble(x=rep(ll, 1000)) df #> # A tibble: 8,000 × 1 #>x #> #> 1 "100" #> 2 "1000" #> 3 "200" #> 4 "3000" #> 5 "50" #> 6 "500" #> 7 "" #> 8 "Not Range" #> 9 "100" #> 10 "1000" #> # … with 7,990 more rows df |> dim() #> [1] 80001 write_tsv(df, "data.tsv") data <- open_dataset("data.tsv", format="tsv", skip_rows=1, schema=schema(x=string())) test <- data |> collect() test #> # A tibble: 7,000 × 1 #>x #> #> 1 100 #> 2 1000 #> 3 200 #> 4 3000 #> 5 50 #> 6 500 #> 7 Not Range #> 8 100 #> 9 1000 #> 10 200 #> # … with 6,990 more rows test |> dim() ## the missing values/blanks have been dropped silently #> [1] 70001 sessionInfo() #> R version 4.2.2 (2022-10-31) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Debian GNU/Linux 11 (bullseye) #> #> Matrix products: default #> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 #> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 #> #> locale: #> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_GB.UTF-8LC_COLLATE=en_GB.UTF-8 #> [5] LC_MONETARY=en_GB.UTF-8LC_MESSAGES=en_GB.UTF-8 #> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] arrow_10.0.0forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10 #> [5] purrr_0.3.5 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8 #> [9] ggplot2_3.3.6 tidyverse_1.3.2 #> #> loaded via a namespace (and not attached): #> [1] lubridate_1.8.0 assertthat_0.2.1digest_0.6.30 #> [4] utf8_1.2.2 R6_2.5.1cellranger_1.1.0 #> [7] backports_1.4.1 reprex_2.0.2evaluate_0.17 #> [10] httr_1.4.4 highr_0.9 pillar_1.8.1 #> [13] rlang_1.0.6 googlesheets4_1.0.1 readxl_1.4.1 #> [16] R.utils_2.12.1 R.oo_1.25.0 rmarkdown_2.17 #> [19] styler_1.8.0googledrive_2.0.0 bit_4.0.4 #> [22] munsell_0.5.0 broom_1.0.1 compiler_4.2.2 #> [25] modelr_0.1.9xfun_0.34 pkgconfig_2.0.3 #> [28] htmltools_0.5.3 tidyselect_1.2.0fansi_1.0.3 #> [31] crayon_1.5.2tzdb_0.3.0 dbplyr_2.2.1 #> [34] withr_2.5.0 R.methodsS3_1.8.2 grid_4.2.2 #> [37] jsonlite_1.8.3 gtable_0.3.1lifecycle_1.0.3 #> [40] DBI_1.1.3 magrittr_2.0.3 scales_1.2.1 #> [43] vroom_1.6.0 cli_3.4.1 stringi_1.7.8 #> [46] fs_1.5.2xml2_1.3.3 ellipsis_0.3.2 #> [49] generics_0.1.3 vctrs_0.5.0 tools_4.2.2 #> [52] bit64_4.0.5 R.cache_0.16.0 glue_1.6.2 #> [55] hms_1.1.2 parallel_4.2.2 fastmap_1.1.0 #> [58] yaml_2.3.6 colorspace_2.0-3gargle_1.2.1 #> [61] rvest_1.0.3 knitr_1.40 haven_2.5.1 Created on 2022-11-01 with reprex v2.0.2 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17288) [C++] Create fragment scanners for csv/parquet/orc/ipc
[ https://issues.apache.org/jira/browse/ARROW-17288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627282#comment-17627282 ] Apache Arrow JIRA Bot commented on ARROW-17288: --- This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon. > [C++] Create fragment scanners for csv/parquet/orc/ipc > -- > > Key: ARROW-17288 > URL: https://issues.apache.org/jira/browse/ARROW-17288 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > > Once we have the basic scan node ready (with an initial implementation based > on in-memory fragments) then we can add over the file-format versions. We > may also want to consider adding JSON support for datasets at this time. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17867) [C++][FlightRPC] Expose bulk parameter binding in Flight SQL client
[ https://issues.apache.org/jira/browse/ARROW-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-17867. -- Fix Version/s: 11.0.0 Resolution: Fixed Issue resolved by pull request 14266 [https://github.com/apache/arrow/pull/14266] > [C++][FlightRPC] Expose bulk parameter binding in Flight SQL client > --- > > Key: ARROW-17867 > URL: https://issues.apache.org/jira/browse/ARROW-17867 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 3h > Remaining Estimate: 0h > > Also fix various issues noticed as part of ARROW-17661 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-18205) [C++] Substrait consumer is not converting right side references correctly on joins
[ https://issues.apache.org/jira/browse/ARROW-18205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace resolved ARROW-18205. - Fix Version/s: 11.0.0 Resolution: Fixed Issue resolved by pull request 14558 [https://github.com/apache/arrow/pull/14558] > [C++] Substrait consumer is not converting right side references correctly on > joins > --- > > Key: ARROW-18205 > URL: https://issues.apache.org/jira/browse/ARROW-18205 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > The Substrait plan expresses a join condition as a logical expression like: > {{field(0) == field(3)}} where {{0}} and {{3}} are indices into the > *combined* schema. These are then passed down to Acero which expects: > {{HashJoinNodeOptions(std::vector in_left_keys, > std::vector in_right_keys)}} > However, {{in_left_keys}} are field references into the *left* schema and > {{in_right_keys}} are field references into the *right* schema. > In other words, given the above expression ({{field(0) == field(3)}} if the > schema were: > left: > key: int32 > y: int32 > z: int32 > right: > key: int32 > x: int32 > Then {{in_left_keys}} should be {{field(0)}} (works correct today) and > {{in_right_keys}} should be {{field(0)}} (today we are sending in > {{field(3)}}). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18148) [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather
[ https://issues.apache.org/jira/browse/ARROW-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627236#comment-17627236 ] Nicola Crane commented on ARROW-18148: -- Thanks for taking a look at this in such close detail - interesting to see the detail there around "stream format" and "file format". I think I'm leaning towards option 3 seeming like the way to go. I'm not keen on pushing the term "IPC" on users who are otherwise unaware of it and don't need to be aware of it, and like the API in option 3. Perhaps for now, disregard all of this in that PR you have open and those further docs updates can be made in a follow-up PR once the work to update these functions is done? > [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather > -- > > Key: ARROW-18148 > URL: https://issues.apache.org/jira/browse/ARROW-18148 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R >Reporter: Stephanie Hazlitt >Priority: Minor > Labels: feather > > Following up from [this mailing list > conversation|https://lists.apache.org/thread/nxncph842h8tyovxp04hrzq4y35lq4xq], > I am wondering if the R package should rename `read_ipc_file()` / > write_ipc_file()` to `read_arrow_file()`/ `write_arrow_file()`, or add an > additional alias for both. It might also be helpful to update the > documentation so that users read "Write an Arrow file (formerly known as a > Feather file)" rather than the current Feather-named first approach, assuming > there is a community decision to coalesce around the name Arrow for the file > format, and the project is moving on from the name Feather. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18148) [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather
[ https://issues.apache.org/jira/browse/ARROW-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627230#comment-17627230 ] Stephanie Hazlitt commented on ARROW-18148: --- {quote}> "where do we talk about the nuance" {quote} One approach when making a big design change is to have a short vignette that explains the change itself (e.g. [dbplyr 2.0 did this|https://dbplyr.tidyverse.org/articles/backend-2.html]). What is proposed is not a breaking change, however if the package moves to having `read_arrow(..., format = c("file", "stream", "auto")) it might be worth a 101-level page on the function naming history, given there was an early version of `read_arrow()` which was deprecated, the marketing that needs to be done re: feather vs arrow naming and so on. This could also be done in the proposed Arrow serialization vignette with pointers, as suggested. > [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather > -- > > Key: ARROW-18148 > URL: https://issues.apache.org/jira/browse/ARROW-18148 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R >Reporter: Stephanie Hazlitt >Priority: Minor > Labels: feather > > Following up from [this mailing list > conversation|https://lists.apache.org/thread/nxncph842h8tyovxp04hrzq4y35lq4xq], > I am wondering if the R package should rename `read_ipc_file()` / > write_ipc_file()` to `read_arrow_file()`/ `write_arrow_file()`, or add an > additional alias for both. It might also be helpful to update the > documentation so that users read "Write an Arrow file (formerly known as a > Feather file)" rather than the current Feather-named first approach, assuming > there is a community decision to coalesce around the name Arrow for the file > format, and the project is moving on from the name Feather. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16471) RecordBuilder UnmarshalJSON does not handle extra unknown fields with complex values
[ https://issues.apache.org/jira/browse/ARROW-16471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol reassigned ARROW-16471: - Assignee: Matthew Topol > RecordBuilder UnmarshalJSON does not handle extra unknown fields with complex > values > > > Key: ARROW-16471 > URL: https://issues.apache.org/jira/browse/ARROW-16471 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Affects Versions: 7.0.0 >Reporter: Phillip LeBlanc >Assignee: Matthew Topol >Priority: Minor > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 10m > Remaining Estimate: 23h 50m > > The fix for https://issues.apache.org/jira/browse/ARROW-16456 only included > support for simple unknown fields with a single value. > i.e. > {code:javascript} > {"region": "NY", "model": "3", "sales": 742.0, "extra": 1234} > {code} > However, nested objects or arrays are still not handled properly. > {code:javascript} > {"region": "NY", "model": "3", "sales": 742.0, "extra_array": [1234], > "extra_object": {"nested": ["deeply"]}} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16471) RecordBuilder UnmarshalJSON does not handle extra unknown fields with complex values
[ https://issues.apache.org/jira/browse/ARROW-16471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16471: --- Labels: pull-request-available (was: ) > RecordBuilder UnmarshalJSON does not handle extra unknown fields with complex > values > > > Key: ARROW-16471 > URL: https://issues.apache.org/jira/browse/ARROW-16471 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Affects Versions: 7.0.0 >Reporter: Phillip LeBlanc >Priority: Minor > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 10m > Remaining Estimate: 23h 50m > > The fix for https://issues.apache.org/jira/browse/ARROW-16456 only included > support for simple unknown fields with a single value. > i.e. > {code:javascript} > {"region": "NY", "model": "3", "sales": 742.0, "extra": 1234} > {code} > However, nested objects or arrays are still not handled properly. > {code:javascript} > {"region": "NY", "model": "3", "sales": 742.0, "extra_array": [1234], > "extra_object": {"nested": ["deeply"]}} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
[ https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627216#comment-17627216 ] Arjan van der Velde commented on ARROW-17374: - I'm running into this issue while building arrow for R from a conda environment. The issue seems to have been introduced by https://issues.apache.org/jira/browse/ARROW-16999. Versions prior to that build fine. > [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND > -- > > Key: ARROW-17374 > URL: https://issues.apache.org/jira/browse/ARROW-17374 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0, 8.0.1, 9.0.0 > Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64 >Reporter: Shane Brennan >Priority: Blocker > Attachments: build-images.out > > > I've been trying to install Arrow on an R notebook within AWS SageMaker. > SageMaker provides Jupyter-like notebooks, with each instance running Amazon > Linux 2 as its OS, itself based on RHEL. > Trying to install a few ways, e.g., using the standard binaries, using the > nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all > still result in the following error. > {noformat} > x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared > -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common > -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags > -Wl,--gc-sections -Wl,--allow-shlib-undefined > -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib > -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib > -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o > array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o > compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o > expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o > json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o > recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o > schema.o symbols.o table.o threadpool.o type_infer.o > -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib > -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz > SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread > -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto > -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR > x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or > directory > make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: > arrow.so] Error 1{noformat} > Snappy is installed on the systems, and both shared object (.so) and cmake > files are there, where I've tried setting the system env variables Snappy_DIR > and Snappy_LIB to point at them, but to no avail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18177) [Go] Implement Add/Sub for Temporal Types
[ https://issues.apache.org/jira/browse/ARROW-18177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-18177: --- Labels: pull-request-available (was: ) > [Go] Implement Add/Sub for Temporal Types > - > > Key: ARROW-18177 > URL: https://issues.apache.org/jira/browse/ARROW-18177 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17899) [Go] Add support for Decimal types in go/arrow/csv/reader.go
[ https://issues.apache.org/jira/browse/ARROW-17899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol resolved ARROW-17899. --- Fix Version/s: 11.0.0 Resolution: Fixed Issue resolved by pull request 14504 [https://github.com/apache/arrow/pull/14504] > [Go] Add support for Decimal types in go/arrow/csv/reader.go > > > Key: ARROW-17899 > URL: https://issues.apache.org/jira/browse/ARROW-17899 > Project: Apache Arrow > Issue Type: Improvement > Components: Go >Reporter: Mitchell Devenport >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18212) [C++] NumericBuilder::Reset() doesn't reset all members
[ https://issues.apache.org/jira/browse/ARROW-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-18212: --- Labels: pull-request-available (was: ) > [C++] NumericBuilder::Reset() doesn't reset all members > --- > > Key: ARROW-18212 > URL: https://issues.apache.org/jira/browse/ARROW-18212 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jin Shang >Assignee: Jin Shang >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-18211) [C++] NumericBuilder::Reset() doesn't reset all members
[ https://issues.apache.org/jira/browse/ARROW-18211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jin Shang resolved ARROW-18211. --- Resolution: Duplicate > [C++] NumericBuilder::Reset() doesn't reset all members > --- > > Key: ARROW-18211 > URL: https://issues.apache.org/jira/browse/ARROW-18211 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jin Shang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-18211) [C++] NumericBuilder::Reset() doesn't reset all members
[ https://issues.apache.org/jira/browse/ARROW-18211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot closed ARROW-18211. - > [C++] NumericBuilder::Reset() doesn't reset all members > --- > > Key: ARROW-18211 > URL: https://issues.apache.org/jira/browse/ARROW-18211 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jin Shang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-18185) [C++][Compute] Support KEEP_NULL option for compute::Filter
[ https://issues.apache.org/jira/browse/ARROW-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jin Shang reassigned ARROW-18185: - Assignee: Jin Shang > [C++][Compute] Support KEEP_NULL option for compute::Filter > --- > > Key: ARROW-18185 > URL: https://issues.apache.org/jira/browse/ARROW-18185 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Jin Shang >Assignee: Jin Shang >Priority: Minor > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > The current Filter implementation always drops the filtered values. In some > use cases, it's desirable for the output array to have the same size as the > inut array. So I added a new option FilterOptions::KEEP_NULL where the > filtered values are kept as nulls. > For example, with input [1, 2, 3] and filter [true, false, true], the current > implementation will output [1, 3] and with the new option it will output [1, > null, 3] > This option is simpler to implement since we only need to construct a new > validity bitmap and reuse the input buffers and child arrays. Except for > dense union arrays which don't have validity bitmaps. > It is also faster to filter with FilterOptions::KEEP_NULL according to the > benchmark result in most cases. So users can choose this option for better > performance when dropping filtered values is not required. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-18212) [C++] NumericBuilder::Reset() doesn't reset all members
[ https://issues.apache.org/jira/browse/ARROW-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jin Shang reassigned ARROW-18212: - Assignee: Jin Shang > [C++] NumericBuilder::Reset() doesn't reset all members > --- > > Key: ARROW-18212 > URL: https://issues.apache.org/jira/browse/ARROW-18212 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jin Shang >Assignee: Jin Shang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18211) [C++] NumericBuilder::Reset() doesn't reset all members
Jin Shang created ARROW-18211: - Summary: [C++] NumericBuilder::Reset() doesn't reset all members Key: ARROW-18211 URL: https://issues.apache.org/jira/browse/ARROW-18211 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Jin Shang -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18161) Ruby Arrow Tables can have buffers get GC'ed
[ https://issues.apache.org/jira/browse/ARROW-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noah Horton updated ARROW-18161: Summary: Ruby Arrow Tables can have buffers get GC'ed (was: Reading Arrow table causes mutations) > Ruby Arrow Tables can have buffers get GC'ed > > > Key: ARROW-18161 > URL: https://issues.apache.org/jira/browse/ARROW-18161 > Project: Apache Arrow > Issue Type: Bug > Components: Ruby >Affects Versions: 9.0.0 > Environment: Ruby 3.1.2 >Reporter: Noah Horton >Assignee: Kouhei Sutou >Priority: Major > > ven an Arrow::Table with several columns "X" > > {code:ruby} > # Rails console outputs > 3.1.2 :107 > x.schema > => > # dates: date32[day] > expected_values: double> > 3.1.2 :108 > x.schema > => > # dates: date32[day] > expected_values: double> > 3.1.2 :109 > {code} > Note that the object and pointer have both changed values. > But the far bigger issue is that repeated reads from it will cause different > results: > {code:ruby} > 3.1.2 :097 > x[1][0] > => Sun, 22 Aug 2021 > 3.1.2 :098 > x[1][1] > => nil > 3.1.2 :099 > x[1][0] > => nil {code} > I have a lot of issues like this - when I have done these types of read > operations, I get the original table with the data in the columns all > shuffled around or deleted. > I do ingest the data slightly oddly in the first place as it comes in over > GRPC and I am using Arrow::Buffer to read it from the GRPC and then passing > that into Arrow::Table.load. But I would not expect that once it was in > Arrow::Table that I could do anything to permute it unintentionally. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18212) [C++] NumericBuilder::Reset() doesn't reset all members
Jin Shang created ARROW-18212: - Summary: [C++] NumericBuilder::Reset() doesn't reset all members Key: ARROW-18212 URL: https://issues.apache.org/jira/browse/ARROW-18212 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Jin Shang -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18161) Reading Arrow table causes mutations
[ https://issues.apache.org/jira/browse/ARROW-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noah Horton updated ARROW-18161: Summary: Reading Arrow table causes mutations (was: Reading error table causes mutations) > Reading Arrow table causes mutations > > > Key: ARROW-18161 > URL: https://issues.apache.org/jira/browse/ARROW-18161 > Project: Apache Arrow > Issue Type: Bug > Components: Ruby >Affects Versions: 9.0.0 > Environment: Ruby 3.1.2 >Reporter: Noah Horton >Assignee: Kouhei Sutou >Priority: Major > > ven an Arrow::Table with several columns "X" > > {code:ruby} > # Rails console outputs > 3.1.2 :107 > x.schema > => > # dates: date32[day] > expected_values: double> > 3.1.2 :108 > x.schema > => > # dates: date32[day] > expected_values: double> > 3.1.2 :109 > {code} > Note that the object and pointer have both changed values. > But the far bigger issue is that repeated reads from it will cause different > results: > {code:ruby} > 3.1.2 :097 > x[1][0] > => Sun, 22 Aug 2021 > 3.1.2 :098 > x[1][1] > => nil > 3.1.2 :099 > x[1][0] > => nil {code} > I have a lot of issues like this - when I have done these types of read > operations, I get the original table with the data in the columns all > shuffled around or deleted. > I do ingest the data slightly oddly in the first place as it comes in over > GRPC and I am using Arrow::Buffer to read it from the GRPC and then passing > that into Arrow::Table.load. But I would not expect that once it was in > Arrow::Table that I could do anything to permute it unintentionally. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18161) Reading error table causes mutations
[ https://issues.apache.org/jira/browse/ARROW-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627180#comment-17627180 ] Noah Horton commented on ARROW-18161: - This appears to have worked for the workaround - thanks. Leaving the ticket as the deeper fix would help folks in the future. > Reading error table causes mutations > > > Key: ARROW-18161 > URL: https://issues.apache.org/jira/browse/ARROW-18161 > Project: Apache Arrow > Issue Type: Bug > Components: Ruby >Affects Versions: 9.0.0 > Environment: Ruby 3.1.2 >Reporter: Noah Horton >Assignee: Kouhei Sutou >Priority: Major > > ven an Arrow::Table with several columns "X" > > {code:ruby} > # Rails console outputs > 3.1.2 :107 > x.schema > => > # dates: date32[day] > expected_values: double> > 3.1.2 :108 > x.schema > => > # dates: date32[day] > expected_values: double> > 3.1.2 :109 > {code} > Note that the object and pointer have both changed values. > But the far bigger issue is that repeated reads from it will cause different > results: > {code:ruby} > 3.1.2 :097 > x[1][0] > => Sun, 22 Aug 2021 > 3.1.2 :098 > x[1][1] > => nil > 3.1.2 :099 > x[1][0] > => nil {code} > I have a lot of issues like this - when I have done these types of read > operations, I get the original table with the data in the columns all > shuffled around or deleted. > I do ingest the data slightly oddly in the first place as it comes in over > GRPC and I am using Arrow::Buffer to read it from the GRPC and then passing > that into Arrow::Table.load. But I would not expect that once it was in > Arrow::Table that I could do anything to permute it unintentionally. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18183) cpp-micro benchmarks are failing on mac arm machine
[ https://issues.apache.org/jira/browse/ARROW-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627172#comment-17627172 ] Weston Pace commented on ARROW-18183: - Thank you. I will look at this today. > cpp-micro benchmarks are failing on mac arm machine > --- > > Key: ARROW-18183 > URL: https://issues.apache.org/jira/browse/ARROW-18183 > Project: Apache Arrow > Issue Type: Bug > Components: Benchmarking, C++ >Reporter: Elena Henderson >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18207) Rubygems not updating in concert with majors
[ https://issues.apache.org/jira/browse/ARROW-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627145#comment-17627145 ] Noah Horton commented on ARROW-18207: - I want to really call out that I appreciate the work of the team, and don't like throwing on opinions when I am not doing the work. That said... ;) Effectively we get to choose holding arrow releases on this stuff (unlikely) breaking windows deployments of arrow apps on ruby, or breaking docker deployments on ruby. I think Docker should win. I cannot imagine that many people are deploying arrow ruby apps on windows without virtualization to Linux. > Rubygems not updating in concert with majors > > > Key: ARROW-18207 > URL: https://issues.apache.org/jira/browse/ARROW-18207 > Project: Apache Arrow > Issue Type: Bug > Components: Ruby >Affects Versions: 10.0.0 >Reporter: Noah Horton >Assignee: Kouhei Sutou >Priority: Major > > 10.0.0 just released, meaning that that all install scripts that use the > 'latest' tag are getting it. > Yet rubygems.org is still running with the 9.0.0 version a week after 10.0.0 > released. > The build scripts need to start updating rubygems.org automatically, or guide > users to a bundler config like > {code:ruby} > gem "red-arrow", github: "apache/arrow", glob: "ruby/red-arrow/*.gemspec", > require: "arrow", tag: 'apache-arrow-10.0.0' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18202) [R] gsub does not work properly
[ https://issues.apache.org/jira/browse/ARROW-18202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627139#comment-17627139 ] Dewey Dunnington commented on ARROW-18202: -- Thank you for reporting! This does sound like it is invalid behaviour. A slightly more minimal reprex with Arrow 10.0.0. {code:R} library(arrow, warn.conflicts = FALSE) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. library(dplyr, warn.conflicts = FALSE) vals <- c("100", "1000", "200" , "3000" , "50", "500", "", "Not Range") record_batch(vals = vals) |> mutate(vals2 = gsub("^$", "0", vals)) |> collect() #> # A tibble: 8 × 2 #> valsvals2 #> #> 1 "100" "100" #> 2 "1000" "1000" #> 3 "200" "200" #> 4 "3000" "3000" #> 5 "50""50" #> 6 "500" "500" #> 7 "" "" #> 8 "Not Range" "Not Range" tibble::tibble(vals = vals) |> mutate(vals2 = gsub("^$", "0", vals)) |> collect() #> # A tibble: 8 × 2 #> valsvals2 #> #> 1 "100" 100 #> 2 "1000" 1000 #> 3 "200" 200 #> 4 "3000" 3000 #> 5 "50"50 #> 6 "500" 500 #> 7 "" 0 #> 8 "Not Range" Not Range {code} > [R] gsub does not work properly > --- > > Key: ARROW-18202 > URL: https://issues.apache.org/jira/browse/ARROW-18202 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Lorenzo Isella >Priority: Major > > Hello, > I think there is a problem with arrow 10.0 and R. I did not have this issue > with arrow 9.0. > Could you please have a look? > Many thanks > > {code:r} > library(tidyverse) > library(arrow) > ll <- c( "100", "1000", "200" , "3000" , "50" , > "500", "" , "Not Range") > df <- tibble(x=rep(ll, 1000), y=seq(8000)) > write_tsv(df, "data.tsv") > data <- open_dataset("data.tsv", format="tsv", > skip_rows=1, > schema=schema(x=string(), > y=double()) > ) > test <- data |> > collect() > ###I want to replace the "" with "0". I believe this worked with arrow 9.0 > df2 <- data |> > mutate(x=gsub("^$","0",x) ) |> > collect() > df2 ### now I did not modify the "" entries in x > #> # A tibble: 8,000 × 2 > #> x y > #> > #> 1 "100" 1 > #> 2 "1000" 2 > #> 3 "200" 3 > #> 4 "3000" 4 > #> 5 "50" 5 > #> 6 "500" 6 > #> 7 "" 7 > #> 8 "Not Range" 8 > #> 9 "100" 9 > #> 10 "1000" 10 > #> # … with 7,990 more rows > > df3 <- df |> > mutate(x=gsub("^$","0",x) ) > df3 ## and this is fine > #> # A tibble: 8,000 × 2 > #> x y > #> > #> 1 100 1 > #> 2 1000 2 > #> 3 200 3 > #> 4 3000 4 > #> 5 50 5 > #> 6 500 6 > #> 7 0 7 > #> 8 Not Range 8 > #> 9 100 9 > #> 10 1000 10 > #> # … with 7,990 more rows > ## How to fix this...I believe this issue did not arise with arrow 9.0. > sessionInfo() > #> R version 4.2.1 (2022-06-23) > #> Platform: x86_64-pc-linux-gnu (64-bit) > #> Running under: Debian GNU/Linux 11 (bullseye) > #> > #> Matrix products: default > #> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 > #> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 > #> > #> locale: > #> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > #> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > #> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > #> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C > #> [9] LC_ADDRESS=C LC_TELEPHONE=C > #> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > #> > #> attached base packages: > #> [1] stats graphics grDevices utils datasets methods base > #> > #> other attached packages: > #> [1] arrow_10.0.0 forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10 > #> [5] purrr_0.3.5 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8 > #> [9] ggplot2_3.3.6 tidyverse_1.3.2 > #> > #> loaded via a namespace (and not attached): > #> [1] lubridate_1.8.0 assertthat_0.2.1 digest_0.6.30 > #> [4] utf8_1.2.2 R6_2.5.1 cellranger_1.1.0 > #> [7] backports_1.4.1 reprex_2.0.2 evaluate_0.17 > #> [10] httr_1.4.4 highr_0.9 pillar_1.8.1 > #> [13] rlang_1.0.6 googlesheets4_1.0.1 readxl_1.4.1 > #> [16] R.utils_2.12.1 R.oo_1.25.0 rmarkdown_2.17 > #> [19] styler_1.8.0 googledrive_2.0.0
[jira] [Commented] (ARROW-18204) [R] Allow setting field metadata
[ https://issues.apache.org/jira/browse/ARROW-18204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627138#comment-17627138 ] Dewey Dunnington commented on ARROW-18204: -- We should totally support this! A workaround in case you need it: {code:R} library(arrow, warn.conflicts = FALSE) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. # remotes::install_github("paleolimbt/narrow") library(narrow) set_field_metadata <- function(field, ...) { vals <- rlang::list2(...) cschema <- narrow::as_narrow_schema(field) current_vals <- cschema$metadata keys <- union(names(vals), names(current_vals)) cschema$metadata <- c(vals, current_vals)[keys] arrow::Field$import_from_c(cschema) } field_metadata <- function(field) { narrow::as_narrow_schema(field)$metadata } (f <- field("some name", int32())) #> Field #> some name: int32 f_meta <- set_field_metadata(f, some_key = "some value") field_metadata(f_meta) #> $some_key #> [1] "some value" {code} > [R] Allow setting field metadata > > > Key: ARROW-18204 > URL: https://issues.apache.org/jira/browse/ARROW-18204 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 10.0.0 >Reporter: Will Jones >Priority: Major > > Currently, can't create a {{Field}} with metadata, which makes it hard to > create tests regarding field metadata. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17731) [Website] Add blog post about Flight SQL JDBC driver
[ https://issues.apache.org/jira/browse/ARROW-17731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-17731. -- Resolution: Fixed Issue resolved by pull request 236 [https://github.com/apache/arrow-site/pull/236] > [Website] Add blog post about Flight SQL JDBC driver > > > Key: ARROW-17731 > URL: https://issues.apache.org/jira/browse/ARROW-17731 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Website >Reporter: David Li >Assignee: David Li >Priority: Major > Fix For: 11.0.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18183) cpp-micro benchmarks are failing on mac arm machine
[ https://issues.apache.org/jira/browse/ARROW-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627037#comment-17627037 ] Yibo Cai commented on ARROW-18183: -- I tried setting ARROW_DEFAULT_MEMORY_POOL to "jemalloc" and "system", same error happens. > cpp-micro benchmarks are failing on mac arm machine > --- > > Key: ARROW-18183 > URL: https://issues.apache.org/jira/browse/ARROW-18183 > Project: Apache Arrow > Issue Type: Bug > Components: Benchmarking, C++ >Reporter: Elena Henderson >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18183) cpp-micro benchmarks are failing on mac arm machine
[ https://issues.apache.org/jira/browse/ARROW-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627030#comment-17627030 ] Yibo Cai commented on ARROW-18183: -- Tested on M1, all "arrow-dataset-scanner-benchmark/scan_alg=1" tests failed with SIGBUS. "scan_alg=0" tests are okay. Stack depth is approaching 4000 from the backtrace. Looks there are a call loop among \{future,async_util\}.\{h,cc\}. ASAN identified stack overflow, logs attached. cc [~westonpace] {code:bash} # all scan_alg:0 tests are okay, all scan_alg:1 tests cause sigbus % debug/arrow-dataset-scanner-benchmark --benchmark_filter=".*scan_alg:1.*" /Users/linux/cyb/arrow/cpp/src/arrow/memory_pool.cc:113: Unsupported backend 'mimalloc' specified in ARROW_DEFAULT_MEMORY_POOL (supported backends are 'jemalloc', 'system') Unable to determine clock rate from sysctl: hw.cpufrequency: No such file or directory This does not affect benchmark measurements, only the metadata output. 2022-11-01T17:02:15+08:00 Running debug/arrow-dataset-scanner-benchmark Run on (8 X 24.2408 MHz CPU s) CPU Caches: L1 Data 64 KiB L1 Instruction 128 KiB L2 Unified 4096 KiB (x8) Load Average: 2.06, 2.81, 2.72 AddressSanitizer:DEADLYSIGNAL = ==75674==ERROR: AddressSanitizer: stack-overflow on address 0x00016b9b3fc0 (pc 0x000106b4b3b4 bp 0x000106b4b3a0 sp 0x00016b9b3fa0 T1) #0 0x106b4b3b4 in __sanitizer::StackDepotBase<__sanitizer::StackDepotNode, 1, 20>::Put(__sanitizer::StackTrace, bool*)+0x4 (libclang_rt.asan_osx_dynamic.dylib:arm64+0x5f3b4) SUMMARY: AddressSanitizer: stack-overflow (libclang_rt.asan_osx_dynamic.dylib:arm64+0x5f3b4) in __sanitizer::StackDepotBase<__sanitizer::StackDepotNode, 1, 20>::Put(__sanitizer::StackTrace, bool*)+0x4 Thread T1 created by T0 here: #0 0x106b2680c in wrap_pthread_create+0x50 (libclang_rt.asan_osx_dynamic.dylib:arm64+0x3a80c) #1 0x113b7a408 in std::__1::__libcpp_thread_create(_opaque_pthread_t**, void* (*)(void*), void*) __threading_support:375 #2 0x113b7a128 in std::__1::thread::thread(arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3&&) thread:309 #3 0x113b67e94 in std::__1::thread::thread(arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3&&) thread:301 #4 0x113b66794 in arrow::internal::ThreadPool::LaunchWorkersUnlocked(int) thread_pool.cc:412 #5 0x113b68444 in arrow::internal::ThreadPool::SpawnReal(arrow::internal::TaskHints, arrow::internal::FnOnce, arrow::StopToken, arrow::internal::FnOnce&&) thread_pool.cc:448 #6 0x10488dfd8 in arrow::Result > > > > arrow::internal::Executor::Submit > > > >(arrow::internal::TaskHints, arrow::StopToken, arrow::dataset::(anonymous namespace)::GetFragments(arrow::dataset::Dataset*, arrow::compute::Expression)::$_0&&) thread_pool.h:167 #7 0x10488be74 in arrow::Result > > > > arrow::internal::Executor::Submit > > > >(arrow::dataset::(anonymous namespace)::GetFragments(arrow::dataset::Dataset*, arrow::compute::Expression)::$_0&&) thread_pool.h:193 #8 0x10488ac0c in arrow::dataset::(anonymous namespace)::GetFragments(arrow::dataset::Dataset*, arrow::compute::Expression) scan_node.cc:64 #9 0x10488a010 in arrow::dataset::(anonymous namespace)::ScanNode::StartProducing() scan_node.cc:318 #10 0x113fc43e0 in arrow::compute::(anonymous namespace)::ExecPlanImpl::StartProducing() exec_plan.cc:183 #11 0x113fc362c in arrow::compute::ExecPlan::StartProducing() exec_plan.cc:400 #12 0x104462260 in arrow::dataset::MinimalEndToEndScan(unsigned long, unsigned long, std::__1::basic_string, std::__1::allocator > const&, std::__1::function > (unsigned long, unsigned long)>) scanner_benchmark.cc:159 #13 0x104468ebc in arrow::dataset::MinimalEndToEndBench(benchmark::State&) scanner_benchmark.cc:272 #14 0x1055dbc8c in benchmark::internal::BenchmarkInstance::Run(long long, int, benchmark::internal::ThreadTimer*, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*) const+0x44 (libbenchmark.1.7.0.dylib:arm64+0xbc8c) #15 0x1055ed708 in benchmark::internal::(anonymous namespace)::RunInThread(benchmark::internal::BenchmarkInstance const*, long long, int, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*)+0x58 (libbenchmark.1.7.0.dylib:arm64+0x1d708) #16 0x1055ed2c8 in benchmark::internal::BenchmarkRunner::DoNIterations()+0x2c0 (libbenchmark.1.7.0.dylib:arm64+0x1d2c8) #17 0x1055edfec in benchmark::internal::BenchmarkRunner::DoOneRepetition()+0xb0 (libbenchmark.1.7.0.dylib:arm64+0x1dfec) #18 0x1055d4fb8 in benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*, std::__1::basic_string, std::__1::allocator >)+0x9f0 (libbenchmark.1.7.0.dylib:arm64+0x4fb8) #19 0x1055d4564 in benchmark::RunSpecifiedBenchmarks()+0x3c
[jira] [Resolved] (ARROW-18162) [C++] Add Arm SVE compiler options
[ https://issues.apache.org/jira/browse/ARROW-18162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yibo Cai resolved ARROW-18162. -- Fix Version/s: 11.0.0 Resolution: Fixed Issue resolved by pull request 14515 [https://github.com/apache/arrow/pull/14515] > [C++] Add Arm SVE compiler options > -- > > Key: ARROW-18162 > URL: https://issues.apache.org/jira/browse/ARROW-18162 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yibo Cai >Assignee: Yibo Cai >Priority: Major > Labels: pull-request-available > Fix For: 11.0.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {{xsimd}} 9.0+ supports Arm SVE (fixed size). Some additional compiler > options are required to enable SVE. > Per my test on Amazon Graviton3 (SVE-256). SVE256 performs much better than > NEON for some cases. E.g., utf8 benchmark {{ValidateLargeAscii}} improves > from *38.6* (NEON) to *51.5* (SVE256) GB/s. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-18210) [C++][Parquet] Skip check in StreamWriter
[ https://issues.apache.org/jira/browse/ARROW-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627001#comment-17627001 ] Antoine Pitrou edited comment on ARROW-18210 at 11/1/22 8:21 AM: - I see. I don't think you can expect excellent performance from {{{}StreamWriter{}}}. Parquet is a columnar format, so you should feed the data column-wise rather than row-wise. Take a look at the {{TypedColumnWriter}} class and ensure you write data in batches. was (Author: pitrou): I see. I don't think you can expect excellent performance from StreamWriter. Parquet is a columnar format, so you should feed the data column-wise rather than row-wise. Take a look at the {{TypedColumnWriter}} and ensure you write data in batches. > [C++][Parquet] Skip check in StreamWriter > - > > Key: ARROW-18210 > URL: https://issues.apache.org/jira/browse/ARROW-18210 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Affects Versions: 10.0.0 >Reporter: Madhur >Priority: Major > > Currently StreamWriter is slower only because of checking of columns, if we > allow customization option (maybe ctor arg) to skip the check then > StreamWriter can be more efficient? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18210) [C++][Parquet] Skip check in StreamWriter
[ https://issues.apache.org/jira/browse/ARROW-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627001#comment-17627001 ] Antoine Pitrou commented on ARROW-18210: I see. I don't think you can expect excellent performance from StreamWriter. Parquet is a columnar format, so you should feed the data column-wise rather than row-wise. Take a look at the {{TypedColumnWriter}} and ensure you write data in batches. > [C++][Parquet] Skip check in StreamWriter > - > > Key: ARROW-18210 > URL: https://issues.apache.org/jira/browse/ARROW-18210 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Affects Versions: 10.0.0 >Reporter: Madhur >Priority: Major > > Currently StreamWriter is slower only because of checking of columns, if we > allow customization option (maybe ctor arg) to skip the check then > StreamWriter can be more efficient? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18210) [C++][Parquet] Skip check in StreamWriter
[ https://issues.apache.org/jira/browse/ARROW-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-18210: --- Summary: [C++][Parquet] Skip check in StreamWriter (was: Skip check in StreamWriter) > [C++][Parquet] Skip check in StreamWriter > - > > Key: ARROW-18210 > URL: https://issues.apache.org/jira/browse/ARROW-18210 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 10.0.0 >Reporter: Madhur >Priority: Major > > Currently StreamWriter is slower only because of checking of columns, if we > allow customization option (maybe ctor arg) to skip the check then > StreamWriter can be more efficient? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18210) [C++][Parquet] Skip check in StreamWriter
[ https://issues.apache.org/jira/browse/ARROW-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-18210: --- Component/s: Parquet > [C++][Parquet] Skip check in StreamWriter > - > > Key: ARROW-18210 > URL: https://issues.apache.org/jira/browse/ARROW-18210 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Affects Versions: 10.0.0 >Reporter: Madhur >Priority: Major > > Currently StreamWriter is slower only because of checking of columns, if we > allow customization option (maybe ctor arg) to skip the check then > StreamWriter can be more efficient? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18210) Skip check in StreamWriter
[ https://issues.apache.org/jira/browse/ARROW-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627000#comment-17627000 ] Madhur commented on ARROW-18210: Yes https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet12StreamWriterE > Skip check in StreamWriter > -- > > Key: ARROW-18210 > URL: https://issues.apache.org/jira/browse/ARROW-18210 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 10.0.0 >Reporter: Madhur >Priority: Major > > Currently StreamWriter is slower only because of checking of columns, if we > allow customization option (maybe ctor arg) to skip the check then > StreamWriter can be more efficient? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18210) Skip check in StreamWriter
[ https://issues.apache.org/jira/browse/ARROW-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17626998#comment-17626998 ] Antoine Pitrou commented on ARROW-18210: Your issue description is not clear, is it about the Parquet StreamWriter? > Skip check in StreamWriter > -- > > Key: ARROW-18210 > URL: https://issues.apache.org/jira/browse/ARROW-18210 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 10.0.0 >Reporter: Madhur >Priority: Major > > Currently StreamWriter is slower only because of checking of columns, if we > allow customization option (maybe ctor arg) to skip the check then > StreamWriter can be more efficient? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17239) [C++] Calculate output type from aggregate to convert arrow aggregate to substrait
[ https://issues.apache.org/jira/browse/ARROW-17239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vibhatha Lakmal Abeykoon reassigned ARROW-17239: Assignee: Vibhatha Lakmal Abeykoon > [C++] Calculate output type from aggregate to convert arrow aggregate to > substrait > -- > > Key: ARROW-17239 > URL: https://issues.apache.org/jira/browse/ARROW-17239 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > Labels: substrait > Fix For: 11.0.0 > > > I am adding support for mapping to/from Arrow aggregates and Substrait > aggregates in ARROW-15582. However, the Arrow-Substrait direction is > currently blocked because the Substrait plan needs to know the output type of > an aggregate and there is no easy way to determine that from the Arrow > information we have. > We should be able to get this information from the function registry but the > conversion routines do not have access to the function registry that I can > tell. I'm not sure if the best solution is to pass the function registry > into ToProto or to add the output type to the aggregate in Arrow. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-18205) [C++] Substrait consumer is not converting right side references correctly on joins
[ https://issues.apache.org/jira/browse/ARROW-18205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-18205: --- Labels: pull-request-available (was: ) > [C++] Substrait consumer is not converting right side references correctly on > joins > --- > > Key: ARROW-18205 > URL: https://issues.apache.org/jira/browse/ARROW-18205 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The Substrait plan expresses a join condition as a logical expression like: > {{field(0) == field(3)}} where {{0}} and {{3}} are indices into the > *combined* schema. These are then passed down to Acero which expects: > {{HashJoinNodeOptions(std::vector in_left_keys, > std::vector in_right_keys)}} > However, {{in_left_keys}} are field references into the *left* schema and > {{in_right_keys}} are field references into the *right* schema. > In other words, given the above expression ({{field(0) == field(3)}} if the > schema were: > left: > key: int32 > y: int32 > z: int32 > right: > key: int32 > x: int32 > Then {{in_left_keys}} should be {{field(0)}} (works correct today) and > {{in_right_keys}} should be {{field(0)}} (today we are sending in > {{field(3)}}). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-18148) [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather
[ https://issues.apache.org/jira/browse/ARROW-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17626980#comment-17626980 ] Danielle Navarro edited comment on ARROW-18148 at 11/1/22 6:56 AM: --- Tentatively offering some thoughts :-) If I'm understanding this properly, we have two problems: - The first problem is that the history of serializing Arrow objects is messy and has left us with three names that people might recognize: Feather, IPC, Arrow. We'd like users to transition to using "Arrow" as the preferred name, and to give them an API that reflects that terminology. - The second problem is that we use "file format" and "stream format" to mean something subtly different from "files" and "streams". The file format wraps the stream format with magic numbers at the start and end, with a footer written after the stream. These two formats aren't *inherently* tied to files and streams. The user can write a "stream formatted" file if they want (i.e., no magic numbers, no footers) and they can also send a "file formatted" serialization (i.e., with the magic number and footer) to an output stream if they want to. The current API allows this, but users would be forgiven for missing this subtle detail! h2. Option 1: Don't change the API, only the docs This option would leave `read_ipc_file()`, `write_ipc_file()`, `read_ipc_stream()`, and `write_ipc_stream()` as the four user-facing functions (treating `read_feather()` and `write_feather()` as soft-deprecated, and leaving `write_to_raw()` untouched) The only thing that would change in this version is that we would consistently refer to "Arrow IPC file" and "Arrow IPC stream" everywhere (i.e., never truncating it to "IPC"). Language around "feather" would be relegated to a secondary position (e.g., "formerly known as Feather"), and we would emphasize that the preferred file extension for V2 feather files is `.arrow`. h2. Option 2: New names for the existing four functions This option would replace `read_ipc_file()` with `read_arrow_file()`, `read_ipc_stream()` with `read_arrow_stream()` and so on. The `ipc` and `feather` versions would be soft-deprecated. The documentation would be updated accordingly. We'd now refer to "Arrow file" and "Arrow stream" everywhere. As with option 1 we'd use language like "formerly known as Feather" to explain the history (perhaps linking back to the old repo just to highlight the origin). We would also, where relevant, note that "Arrow stream" is a conventional name for the "Arrow inter-process communication (IPC) streaming format", as a way of (a) explaining the ipc versions of the functions, and (b) helping users find the relevant part of the Arrow specification. h2. Option 3: Reduce API to two functions This option would have only two functions, `read_arrow()` and `write_arrow()`. Both functions would have a new argument called `format` (or something similar). Users could specify either `format = "stream"` or `format = "file"`. From a documentation perspective this would require a little more finessing: we might have to have separate the help topics for the new API and older versions of API to avoid mess. But it might have the advantage of making clearer to users that the terms `"stream"` and `"file"` don't actually refer to *where* you're writing the data, but how you *encode* the data when you write it. h2. Preferences? I am not sure what I prefer, but I can at least say what I think the strengths and weaknesses are for each proposal: Option 3 seems like the cleanest in terms of making the Arrow/Feather/IPC functions feel analogous to the other functions in the read/write API: `read_arrow()` and `write_arrow()` feels closely aligned with `read_parquet()` and `write_parquet()`. It makes very clear that these functions are designed to read and write Arrow objects in an "Arrow-like" way. However, it does have the disadvantage that the encoding vs destination complexity gets pushed into the arguments: users will need to understand why there is `format` argument that is distinct from the `file`/`sink` argument, and the documentation will need to explain that. Option 2 has the advantage of preserving the same "four-function structure"" as the existing serialization API, but it does come at the expense of being a little misleading to anyone who doesn't understand that the function names refer to the encoding not the destination: `write_arrow_stream()` can in fact write to a file, and `write_arrow_file()` can write to a stream. That's potentially even more confusing. Option 1 has the advantage of not confusing existing users. The API doesn't change, and the documentation becomes slightly more informative. The disadvantage is that it leaves new users a bit confused about what the heck an "IPC" is, which means the documentation will have to carry the load. h2.
[jira] [Comment Edited] (ARROW-18148) [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather
[ https://issues.apache.org/jira/browse/ARROW-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17626980#comment-17626980 ] Danielle Navarro edited comment on ARROW-18148 at 11/1/22 6:54 AM: --- Tentatively offering some thoughts :-) If I'm understanding this properly, we have two problems: - The first problem is that the history of serializing Arrow objects is messy and has left us with three names that people might recognize: Feather, IPC, Arrow. We'd like users to transition to using "Arrow" as the preferred name, and to give them an API that reflects that terminology. - The second problem is that we use "file format" and "stream format" to mean something subtly different from "files" and "streams". The file format wraps the stream format with magic numbers at the start and end, with a footer written after the stream. These two formats aren't *inherently* tied to files and streams. The user can write a "stream formatted" file if they want (i.e., no magic numbers, no footers) and they can also send a "file formatted" serialization (i.e., with the magic number and footer) to an output stream if they want to. The current API allows this, but users would be forgiven for missing this subtle detail! h2. Option 1: Don't change the API, only the docs This option would leave `read_ipc_file()`, `write_ipc_file()`, `read_ipc_stream()`, and `write_ipc_stream()` as the four user-facing functions (treating `read_feather()` and `write_feather()` as soft-deprecated, and leaving `write_to_raw()` untouched) The only thing that would change in this version is that we would consistently refer to "Arrow IPC file" and "Arrow IPC stream" everywhere (i.e., never truncating it to "IPC"). Language around "feather" would be relegated to a secondary position (e.g., "formerly known as Feather"), and we would emphasize that the preferred file extension is `.arrow`. h2. Option 2: New names for the existing four functions This option would replace `read_ipc_file()` with `read_arrow_file()`, `read_ipc_stream()` with `read_arrow_stream()` and so on. The `ipc` and `feather` versions would be soft-deprecated. The documentation would be updated accordingly. We'd now refer to "Arrow file" and "Arrow stream" everywhere. As with option 1 we'd use language like "formerly known as Feather" to explain the history (perhaps linking back to the old repo just to highlight the origin). We would also, where relevant, note that "Arrow stream" is a conventional name for the "Arrow inter-process communication (IPC) streaming format", as a way of (a) explaining the ipc versions of the functions, and (b) helping users find the relevant part of the Arrow specification. h2. Option 3: Reduce API to two functions This option would have only two functions, `read_arrow()` and `write_arrow()`. Both functions would have a new argument called `format` (or something similar). Users could specify either `format = "stream"` or `format = "file"`. From a documentation perspective this would require a little more finessing: we might have to have separate the help topics for the new API and older versions of API to avoid mess. But it might have the advantage of making clearer to users that the terms `"stream"` and `"file"` don't actually refer to *where* you're writing the data, but how you *encode* the data when you write it. h2. Preferences? I am not sure what I prefer, but I can at least say what I think the strengths and weaknesses are for each proposal: Option 3 seems like the cleanest in terms of making the Arrow/Feather/IPC functions feel analogous to the other functions in the read/write API: `read_arrow()` and `write_arrow()` feels closely aligned with `read_parquet()` and `write_parquet()`. It makes very clear that these functions are designed to read and write Arrow objects in an "Arrow-like" way. However, it does have the disadvantage that the encoding vs destination complexity gets pushed into the arguments: users will need to understand why there is `format` argument that is distinct from the `file`/`sink` argument, and the documentation will need to explain that. Option 2 has the advantage of preserving the same "four-function structure"" as the existing serialization API, but it does come at the expense of being a little misleading to anyone who doesn't understand that the function names refer to the encoding not the destination: `write_arrow_stream()` can in fact write to a file, and `write_arrow_file()` can write to a stream. That's potentially even more confusing. Option 1 has the advantage of not confusing existing users. The API doesn't change, and the documentation becomes slightly more informative. The disadvantage is that it leaves new users a bit confused about what the heck an "IPC" is, which means the documentation will have to carry the load. h2. Additional documentation
[jira] [Created] (ARROW-18210) Skip check in StreamWriter
Madhur created ARROW-18210: -- Summary: Skip check in StreamWriter Key: ARROW-18210 URL: https://issues.apache.org/jira/browse/ARROW-18210 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 10.0.0 Reporter: Madhur Currently StreamWriter is slower only because of checking of columns, if we allow customization option (maybe ctor arg) to skip the check then StreamWriter can be more efficient? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18148) [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather
[ https://issues.apache.org/jira/browse/ARROW-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17626980#comment-17626980 ] Danielle Navarro commented on ARROW-18148: -- Tentatively offering some thoughts :-) If I'm understanding this properly, we have two problems: - The first problem is that the history of serializing Arrow objects is messy and has left us with three names that people might recognize: Feather, IPC, Arrow. We'd like users to transition to using "Arrow" as the preferred name, and to give them an API that reflects that terminology. - The second problem is that we use "file format" and "stream format" to mean something subtly different from "files" and "streams". The file format wraps the stream format with magic numbers at the start and end, with a footer written after the stream. These two formats aren't *inherently* tied to files and streams. The user can write a "stream formatted" file if they want (i.e., no magic numbers, no footers) and they can also send a "file formatted" serialization (i.e., with the magic number and footer) to an output stream if they want to. The current API allows this, but users would be forgiven for missing this subtle detail! ## Option 1: Don't change the API, only the docs This option would leave `read_ipc_file()`, `write_ipc_file()`, `read_ipc_stream()`, and `write_ipc_stream()` as the four user-facing functions (treating `read_feather()` and `write_feather()` as soft-deprecated, and leaving `write_to_raw()` untouched) The only thing that would change in this version is that we would consistently refer to "Arrow IPC file" and "Arrow IPC stream" everywhere (i.e., never truncating it to "IPC"). Language around "feather" would be relegated to a secondary position (e.g., "formerly known as Feather"), and we would emphasize that the preferred file extension is `.arrow`. ## Option 2: New names for the existing four functions This option would replace `read_ipc_file()` with `read_arrow_file()`, `read_ipc_stream()` with `read_arrow_stream()` and so on. The `ipc` and `feather` versions would be soft-deprecated. The documentation would be updated accordingly. We'd now refer to "Arrow file" and "Arrow stream" everywhere. As with option 1 we'd use language like "formerly known as Feather" to explain the history (perhaps linking back to the old repo just to highlight the origin). We would also, where relevant, note that "Arrow stream" is a conventional name for the "Arrow inter-process communication (IPC) streaming format", as a way of (a) explaining the ipc versions of the functions, and (b) helping users find the relevant part of the Arrow specification. ## Option 3: Reduce API to two functions This option would have only two functions, `read_arrow()` and `write_arrow()`. Both functions would have a new argument called `format` (or something similar). Users could specify either `format = "stream"` or `format = "file"`. From a documentation perspective this would require a little more finessing: we might have to have separate the help topics for the new API and older versions of API to avoid mess. But it might have the advantage of making clearer to users that the terms `"stream"` and `"file"` don't actually refer to *where* you're writing the data, but how you *encode* the data when you write it. ## Preferences? I am not sure what I prefer, but I can at least say what I think the strengths and weaknesses are for each proposal: Option 3 seems like the cleanest in terms of making the Arrow/Feather/IPC functions feel analogous to the other functions in the read/write API: `read_arrow()` and `write_arrow()` feels closely aligned with `read_parquet()` and `write_parquet()`. It makes very clear that these functions are designed to read and write Arrow objects in an "Arrow-like" way. However, it does have the disadvantage that the encoding vs destination complexity gets pushed into the arguments: users will need to understand why there is `format` argument that is distinct from the `file`/`sink` argument, and the documentation will need to explain that. Option 2 has the advantage of preserving the same "four-function structure"" as the existing serialization API, but it does come at the expense of being a little misleading to anyone who doesn't understand that the function names refer to the encoding not the destination: `write_arrow_stream()` can in fact write to a file, and `write_arrow_file()` can write to a stream. That's potentially even more confusing. Option 1 has the advantage of not confusing existing users. The API doesn't change, and the documentation becomes slightly more informative. The disadvantage is that it leaves new users a bit confused about what the heck an "IPC" is, which means the documentation will have to carry the load. ## Additional documentation thoughts Regardless of what option we go with, I'll
[jira] [Updated] (ARROW-18209) [Java] Make ComplexCopier agnostic of specific implementation of MapWriter, i.e UnionMapWriter
[ https://issues.apache.org/jira/browse/ARROW-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-18209: --- Labels: pull-request-available (was: ) > [Java] Make ComplexCopier agnostic of specific implementation of MapWriter, > i.e UnionMapWriter > -- > > Key: ARROW-18209 > URL: https://issues.apache.org/jira/browse/ARROW-18209 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Vivek Shankar >Assignee: Vivek Shankar >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Making ComplexCopier independent of UnionMapWriter lets us use different > implementations, like PromotableWriter instead. This helps us to copy map > vector with a map value. Otherwise we get the following error: > {code:java} > ClassCast class org.apache.arrow.vector.complex.impl.PromotableWriter cannot > be cast to class org.apache.arrow.vector.complex.impl.UnionMapWriter{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)