date:20220811

[jira] [Updated] (ARROW-17356) [R] Update binding for add_filename() NSE function to error if used on Table

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17356:
-
Fix Version/s: 10.0.0

> [R] Update binding for add_filename() NSE function to error if used on Table
> 
>
> Key: ARROW-17356
> URL: https://issues.apache.org/jira/browse/ARROW-17356
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
> Fix For: 10.0.0
>
>
> ARROW-15260 adds a function which allows the user to add the filename as an 
> output field.  This function only makes sense to use with datasets and not 
> tables.  Currently, the error generated from using it with a table is handled 
> by {{handle_augmented_field_misuse()}}.  Instead, we should follow [one of 
> the suggestions from the 
> PR|https://github.com/apache/arrow/pull/12826#issuecomment-1192007298] to 
> detect this when the function is called.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17371) [R] Remove as.factor to dictionary_encode mapping

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17371:
-
Fix Version/s: 10.0.0

> [R] Remove as.factor to dictionary_encode mapping
> -
>
> Key: ARROW-17371
> URL: https://issues.apache.org/jira/browse/ARROW-17371
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is an NSE func mapping between {{base::as.factor}} and Acero's 
> {{dictionary_encode}}.  However, it doesn't work at present - see 
> ARROW-12632.  Calling this function results in an error.  We should remove 
> this mapping so that an error is raised and we call {{as.factor}} in R not 
> Acero.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17366) [R] Support purrr-style lambda functions in .fns argument to across()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17366:
-
Fix Version/s: 10.0.0

> [R] Support purrr-style lambda functions in .fns argument to across()
> -
>
> Key: ARROW-17366
> URL: https://issues.apache.org/jira/browse/ARROW-17366
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
> Fix For: 10.0.0
>
>
> ARROW-11699 adds support for dplyr::across inside a mutate(). The .fns 
> argument does not yet support purrr-style lambda functions (e.g. {{~round(.x, 
> digits = -1)}} but should be added. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17364) [R] Implement .names argument inside across()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17364:
-
Fix Version/s: 10.0.0

> [R] Implement .names argument inside across()
> -
>
> Key: ARROW-17364
> URL: https://issues.apache.org/jira/browse/ARROW-17364
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
> Fix For: 10.0.0
>
>
> ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
> {{.names}} argument is not  yet supported but should be added. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17362) [R] Implement dplyr::across() inside summarise()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17362:
-
Fix Version/s: 10.0.0

> [R] Implement dplyr::across() inside summarise()
> 
>
> Key: ARROW-17362
> URL: https://issues.apache.org/jira/browse/ARROW-17362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
> Fix For: 10.0.0
>
>
> ARROW-11699 adds the ability to call dplyr::across() inside dplyr::mutate().  
> Once this is merged, we should also add the ability to do so within 
> dplyr::summarise().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17355) [R] Refactor the handle_* utility functions for a better dev experience

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17355:
-
Fix Version/s: 10.0.0

> [R] Refactor the handle_* utility functions for a better dev experience
> ---
>
> Key: ARROW-17355
> URL: https://issues.apache.org/jira/browse/ARROW-17355
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
> Fix For: 10.0.0
>
>
> In ARROW-15260, the utility functions for handling different kinds of reading 
> errors (handle_parquet_io_error, handle_csv_read_error, and 
> handle_augmented_field_misuse) were refactored so that multiple ones could be 
> chained together. An issue with this is that other errors may be swallowed if 
> they're used without any errors that they don't capture being raised manually 
> afterwards.  We should update the code to prevent this from being possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17364) [R] Implement .names argument inside across()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17364:
-
Labels: good-second-issue  (was: )

> [R] Implement .names argument inside across()
> -
>
> Key: ARROW-17364
> URL: https://issues.apache.org/jira/browse/ARROW-17364
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: good-second-issue
> Fix For: 10.0.0
>
>
> ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
> {{.names}} argument is not  yet supported but should be added. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17334) [Java] Add maven-source-plugin to package source code

2022-08-11 Thread Jin Chengcheng (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jin Chengcheng resolved ARROW-17334.

Resolution: Fixed

[arrow/java_full_build.sh at acf25f764d6076b46f18bbbe2b04b0a5d970d448 · 
apache/arrow 
(github.com)|https://github.com/apache/arrow/blob/acf25f764d6076b46f18bbbe2b04b0a5d970d448/ci/scripts/java_full_build.sh#L50:L60]

Follow this example, we get get source jar

> [Java] Add maven-source-plugin to package source code
> -
>
> Key: ARROW-17334
> URL: https://issues.apache.org/jira/browse/ARROW-17334
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 10.0.0
>Reporter: Jin Chengcheng
>Assignee: Jin Chengcheng
>Priority: Major
>  Labels: easyfix, pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> With maven-source-plugin plugin, we will get 
> `arrow-vector-10.0.0-SNAPSHOT-sources.jar` in  
> ~/.m2/repository/org/apache/arrow/arrow-vector/10.0.0-SNAPSHOT/.
> Then users can click Download Source to install it in Intellij, not current 
> decompiled class, the line number and code will be reflected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17362) [R] Implement dplyr::across() inside summarise()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17362:
-
Labels: dplyr  (was: )

> [R] Implement dplyr::across() inside summarise()
> 
>
> Key: ARROW-17362
> URL: https://issues.apache.org/jira/browse/ARROW-17362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: dplyr
> Fix For: 10.0.0
>
>
> ARROW-11699 adds the ability to call dplyr::across() inside dplyr::mutate().  
> Once this is merged, we should also add the ability to do so within 
> dplyr::summarise().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17361) [R] dplyr::summarize fails with division when divisor is a variable

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17361:
-
Summary: [R] dplyr::summarize fails with division when divisor is a 
variable  (was: dplyr::summarize fails with division when divisor is a variable)

> [R] dplyr::summarize fails with division when divisor is a variable
> ---
>
> Key: ARROW-17361
> URL: https://issues.apache.org/jira/browse/ARROW-17361
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Oliver Reiter
>Priority: Minor
>  Labels: aggregation, dplyr
>
> Hello,
> I found this odd behaviour when trying to compute an aggregate with 
> dplyr::summarize: When I want to use a pre-defined variable to do a divison 
> while aggregating, the execution fails with 'unsupported expression'. When I 
> the value of the variable as is in the aggregation, it works.
>  
> See below:
>  
> {code:java}
> library(dplyr)
> library(arrow)
> small_dataset <- tibble::tibble(
>   ## x = rep(c("a", "b"), each = 5),
>   y = rep(1:5, 2)
> )
> ## convert "small_dataset" into a ...dataset
> tmpdir <- tempfile()
> dir.create(tmpdir)
> write_dataset(small_dataset, tmpdir)
> ## works
> open_dataset(tmpdir) %>%
>   summarize(value = sum(y) / 10) %>%
>   collect()
> ## fails
> scale_factor <- 10
> open_dataset(tmpdir) %>%
>   summarize(value = sum(y) / scale_factor) %>%
>   collect()
> #> Fehler: Error in summarize_eval(names(exprs)[i],
> #> exprs[[i]], ctx, length(.data$group_by_vars) > :
> #   Expression sum(y)/scale_factor is not an aggregate
> #   expression or is not supported in Arrow
> # Call collect() first to pull data into R.
>    {code}
> I was not sure how to name this issue/bug (if it is one), so if there is a 
> clearer, more descriptive title you're welcome to adjust.
>  
> Thanks for your work!
>  
> Oliver
>  
> {code:java}
> > arrow_info()
> Arrow package version: 8.0.0
> Capabilities:
>                
> dataset    TRUE
> substrait FALSE
> parquet    TRUE
> json       TRUE
> s3         TRUE
> utf8proc   TRUE
> re2        TRUE
> snappy     TRUE
> gzip       TRUE
> brotli     TRUE
> zstd       TRUE
> lz4        TRUE
> lz4_frame  TRUE
> lzo       FALSE
> bz2        TRUE
> jemalloc   TRUE
> mimalloc   TRUE
> Memory:
>                   
> Allocator jemalloc
> Current   64 bytes
> Max       41.25 Kb
> Runtime:
>                         
> SIMD Level          avx2
> Detected SIMD Level avx2
> Build:
>                            
> C++ Library Version   8.0.0
> C++ Compiler            GNU
> C++ Compiler Version 12.1.0 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-14742) [C++] Allow ParquetWriter to take a RecordBatchReader as input

2022-08-11 Thread Alvin Chunga Mamani (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alvin Chunga Mamani reassigned ARROW-14742:
---

Assignee: Alvin Chunga Mamani

> [C++] Allow ParquetWriter to take a RecordBatchReader as input
> --
>
> Key: ARROW-14742
> URL: https://issues.apache.org/jira/browse/ARROW-14742
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Nicola Crane
>Assignee: Alvin Chunga Mamani
>Priority: Critical
>
> Please could we extend the Parquet Writer to not only take a Table or 
> RecordBatch as inputs, but also RecordBatchReader?  This would be 
> super-helpful for opening data as a dataset and writing to a single file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16965) [R] Update case_when() binding to match changes in dplyr

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-16965:
-
Labels: dplyr  (was: )

> [R] Update case_when() binding to match changes in dplyr
> 
>
> Key: ARROW-16965
> URL: https://issues.apache.org/jira/browse/ARROW-16965
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Ian Cook
>Priority: Major
>  Labels: dplyr
>
> [https://github.com/tidyverse/dplyr/pull/6300] introduced a change to the 
> {{case_when()}} API.
> Most importantly:
> {quote}There is a new {{.default}} argument that is intended to replace usage 
> of {{TRUE ~ default_value}}
> {quote}
> There are also new arguments {{.ptype}} and {{{}.size{}}}.
> There are a few other changes also, highlighted in the [dplyr NEWS.md 
> file|https://github.com/tidyverse/dplyr/blob/main/NEWS.md]
> Also see [https://twitter.com/dvaughan32/status/1542942862077317121].
> We should update the {{case_when()}} binding in the arrow R package to be 
> consistent with its new behaviors in dplyr, or to throw an error if the user 
> passes new arguments that we cannot handle consistently with dplyr.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-14414) [R] Implement dplyr::bind_rows()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-14414:
-
Labels: dplyr  (was: )

> [R] Implement dplyr::bind_rows()
> 
>
> Key: ARROW-14414
> URL: https://issues.apache.org/jira/browse/ARROW-14414
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: dplyr
>
> {{pyarrow}} has {{concat_table()}}, this could be implemented via 
> ConcatenateTables (see https://issues.apache.org/jira/browse/ARROW-438)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-14415) [R] Implement dplyr::bind_cols()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-14415:
-
Labels: dplyr  (was: )

> [R] Implement dplyr::bind_cols()
> 
>
> Key: ARROW-14415
> URL: https://issues.apache.org/jira/browse/ARROW-14415
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: dplyr
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-12137) [R] New/improved vignette on dplyr features

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-12137:
-
Labels: dplyr  (was: )

> [R] New/improved vignette on dplyr features
> ---
>
> Key: ARROW-12137
> URL: https://issues.apache.org/jira/browse/ARROW-12137
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: dplyr
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-12778) [R] Support tidyselect where() selection helper in dplyr verbs

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-12778:
-
Labels: dplyr  (was: )

> [R] Support tidyselect where() selection helper in dplyr verbs
> --
>
> Key: ARROW-12778
> URL: https://issues.apache.org/jira/browse/ARROW-12778
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
>  Labels: dplyr
> Fix For: 10.0.0
>
>
> Since we can now determine the data type of an unevaluated array expression 
> (ARROW-12291) I think we should be able to support the {{where()}} selection 
> helper.
> This is already done for the {{relocate()}} verb (in ARROW-12781 ) but not 
> for any other verbs. 
> Steps required to do this:
>  # ARROW-12781 
>  # ARROW-12105
>  # Remove the {{check_select_helpers()}} function definition and remove all 
> the calls to it
>  # Modify any remaining the {{expect_error()}} tests that test {{where()}} 
> and check for the error message {{"Unsupported selection helper"}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-11699) [R] Implement dplyr::across() inside mutate()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-11699:
-
Labels: dplyr pull-request-available  (was: pull-request-available)

> [R] Implement dplyr::across() inside mutate()
> -
>
> Key: ARROW-11699
> URL: https://issues.apache.org/jira/browse/ARROW-11699
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Nicola Crane
>Priority: Major
>  Labels: dplyr, pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> It's not a generic, but because it seems only to be called inside of 
> functions like `mutate()`, we can insert our own version of it into the NSE 
> data mask



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-11755) [R] Add tests from dplyr/test-mutate.r

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-11755:
-
Labels: dplyr pull-request-available  (was: pull-request-available)

> [R] Add tests from dplyr/test-mutate.r
> --
>
> Key: ARROW-11755
> URL: https://issues.apache.org/jira/browse/ARROW-11755
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Minor
>  Labels: dplyr, pull-request-available
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Review 
> https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r 
> and port tests over to arrow as needed to see if there are edge cases we 
> aren't covering appropriately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-14703) [R][Docs] Add docs on what dplyr + tidyverse functionality we support

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-14703:
-
Labels: dplyr  (was: )

> [R][Docs] Add docs on what dplyr + tidyverse functionality we support
> -
>
> Key: ARROW-14703
> URL: https://issues.apache.org/jira/browse/ARROW-14703
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: dplyr
>
> It's impossible to know what we do or do not support without literally trying 
> it out.  Other things to consider:
> * "I think it would be hella useful to have some kind of document giving a 
> view into the arrow package’s coverage of all things dplyr. This would be 
> great for users and also as a project management kind of tool"
> * "and you could print that in an rd page...look into what dbplyr does: it 
> has rd files for its dplyr methods 
> https://dbplyr.tidyverse.org/reference/index.html...it will be trickier for 
> us because we don't export them, so you'll have to do more @roclet fun and 
> probably pkgdown.yml work to make them show up"
> Perhaps this could work as a longer form vignette and a shorter form rd file 
> which prints out all the .rd functions?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-16423) [R] arrow/dplyr: simple join and collect crashes session

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane closed ARROW-16423.

Resolution: Fixed

> [R] arrow/dplyr: simple join and collect crashes session
> 
>
> Key: ARROW-16423
> URL: https://issues.apache.org/jira/browse/ARROW-16423
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Andrew C Thomas
>Assignee: Will Jones
>Priority: Minor
>
> Trying to do an inner join style filter on an open_dataset, and R crashes, 
> but not reliably the first time. Sometimes takes a couple of tries until it 
> does.
> Reprex follows.
> --
> library (arrow)
> library (dplyr)
> library (tidyr)
> DataSet <- expand_grid (A = 1:10, B = 1:10, C = 1:1) %>%
>   group_by (A, B)
> write_dataset(DataSet, "TestBreakData")
> for (DoThisUntilItBreaks in 1:100) {
>   message (DoThisUntilItBreaks)
>   D2 <- open_dataset("TestBreakData") %>% inner_join (data.frame (A=1L, 
> B=1:5)) %>% collect
> }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17384) [R] Additional dplyr functionality

2022-08-11 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-17384:


 Summary: [R] Additional dplyr functionality
 Key: ARROW-17384
 URL: https://issues.apache.org/jira/browse/ARROW-17384
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


Umbrella ticket to collect together tickets relating to implementing additional 
dplyr verbs or unimplemented arguments for implemented verbs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17365) [R] Implement ... argument inside across()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17365:
-
Parent: ARROW-17384
Issue Type: Sub-task  (was: Improvement)

> [R] Implement ... argument inside across()
> --
>
> Key: ARROW-17365
> URL: https://issues.apache.org/jira/browse/ARROW-17365
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
> {{...}} argument is not  yet supported but should be added.  There is a 
> failing test in the PR for ARROW-11699 which references this JIRA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17364) [R] Implement .names argument inside across()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17364:
-
Parent: ARROW-17384
Issue Type: Sub-task  (was: Improvement)

> [R] Implement .names argument inside across()
> -
>
> Key: ARROW-17364
> URL: https://issues.apache.org/jira/browse/ARROW-17364
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: good-second-issue
> Fix For: 10.0.0
>
>
> ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
> {{.names}} argument is not  yet supported but should be added. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17366) [R] Support purrr-style lambda functions in .fns argument to across()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17366:
-
Parent: ARROW-17384
Issue Type: Sub-task  (was: Improvement)

> [R] Support purrr-style lambda functions in .fns argument to across()
> -
>
> Key: ARROW-17366
> URL: https://issues.apache.org/jira/browse/ARROW-17366
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
> Fix For: 10.0.0
>
>
> ARROW-11699 adds support for dplyr::across inside a mutate(). The .fns 
> argument does not yet support purrr-style lambda functions (e.g. {{~round(.x, 
> digits = -1)}} but should be added. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17362) [R] Implement dplyr::across() inside summarise()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17362:
-
Parent: ARROW-17384
Issue Type: Sub-task  (was: Improvement)

> [R] Implement dplyr::across() inside summarise()
> 
>
> Key: ARROW-17362
> URL: https://issues.apache.org/jira/browse/ARROW-17362
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: dplyr
> Fix For: 10.0.0
>
>
> ARROW-11699 adds the ability to call dplyr::across() inside dplyr::mutate().  
> Once this is merged, we should also add the ability to do so within 
> dplyr::summarise().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16965) [R] Update case_when() binding to match changes in dplyr

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-16965:
-
Parent: ARROW-17384
Issue Type: Sub-task  (was: Improvement)

> [R] Update case_when() binding to match changes in dplyr
> 
>
> Key: ARROW-16965
> URL: https://issues.apache.org/jira/browse/ARROW-16965
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Ian Cook
>Priority: Major
>  Labels: dplyr
>
> [https://github.com/tidyverse/dplyr/pull/6300] introduced a change to the 
> {{case_when()}} API.
> Most importantly:
> {quote}There is a new {{.default}} argument that is intended to replace usage 
> of {{TRUE ~ default_value}}
> {quote}
> There are also new arguments {{.ptype}} and {{{}.size{}}}.
> There are a few other changes also, highlighted in the [dplyr NEWS.md 
> file|https://github.com/tidyverse/dplyr/blob/main/NEWS.md]
> Also see [https://twitter.com/dvaughan32/status/1542942862077317121].
> We should update the {{case_when()}} binding in the arrow R package to be 
> consistent with its new behaviors in dplyr, or to throw an error if the user 
> passes new arguments that we cannot handle consistently with dplyr.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-14414) [R] Implement dplyr::bind_rows()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-14414:
-
Parent: ARROW-17384
Issue Type: Sub-task  (was: Improvement)

> [R] Implement dplyr::bind_rows()
> 
>
> Key: ARROW-14414
> URL: https://issues.apache.org/jira/browse/ARROW-14414
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: dplyr
>
> {{pyarrow}} has {{concat_table()}}, this could be implemented via 
> ConcatenateTables (see https://issues.apache.org/jira/browse/ARROW-438)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-14415) [R] Implement dplyr::bind_cols()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-14415:
-
Parent: ARROW-17384
Issue Type: Sub-task  (was: Improvement)

> [R] Implement dplyr::bind_cols()
> 
>
> Key: ARROW-14415
> URL: https://issues.apache.org/jira/browse/ARROW-14415
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: dplyr
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-13766) [R] Add Arrow methods slice_min(), slice_max()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-13766:
-
Parent: ARROW-17384
Issue Type: Sub-task  (was: Improvement)

> [R] Add Arrow methods slice_min(), slice_max()
> --
>
> Key: ARROW-13766
> URL: https://issues.apache.org/jira/browse/ARROW-13766
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Ian Cook
>Priority: Major
>  Labels: query-engine
>
> Implement [{{slice_min()}} and 
> {{slice_max()}}|https://dplyr.tidyverse.org/reference/slice.html] methods for 
> {{ArrowTabular}}, {{Dataset}}, and {{arrow_dplyr_query}} objects.
> These dplyr functions supersede the older dplyr function 
> [{{top_n()}}|https://dplyr.tidyverse.org/reference/top_n.html] which I 
> suppose we should also consider implementing a method for.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-13767) [R] Add Arrow methods slice(), slice_head(), slice_tail()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-13767:
-
Parent: ARROW-17384
Issue Type: Sub-task  (was: Improvement)

> [R] Add Arrow methods slice(), slice_head(), slice_tail()
> -
>
> Key: ARROW-13767
> URL: https://issues.apache.org/jira/browse/ARROW-13767
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Ian Cook
>Priority: Major
>  Labels: query-engine
>
> Implement [{{slice()}}, {{slice_head()}}, and 
> {{slice_tail()}}|https://dplyr.tidyverse.org/reference/slice.html] methods 
> for {{ArrowTabular}}, {{Dataset}}, and {{arrow_dplyr_query}} objects . I 
> believe this should be relatively straightforward, using {{Take()}} to return 
> only the specified rows. We already have a {{head()}} method which I believe 
> we can reuse for {{slice_head()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17373) [R] copying dataset and immediatly writing the copy to a different location fails

2022-08-11 Thread Egill Axfjord Fridgeirsson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578358#comment-17578358
 ] 

Egill Axfjord Fridgeirsson commented on ARROW-17373:


After some further testing it seems the copying is unnecessary.

Opening a large dataset and writing to a different location seems to produce 
the error in most cases.

 

Here is a slightly simpler reprex:
{code:java}
df <- data.frame(replicate(1,sample(0:1,100e6,rep=TRUE)))
savePath <- file.path(tempdir(), 'arrowTest')
if (!dir.exists(savePath)) {
  dir.create(savePath)
}
arrow::write_feather(df, file.path(savePath, 'part-0.feather'))
writePath <- file.path(tempdir(), 'arrowTest')
if (!dir.exists(writePath)) {
  dir.create(writePath)
}
dataset <- arrow::open_dataset(savePath, format='feather')
arrow::write_dataset(dataset = dataset, path = writePath, format = 'feather')
{code}

> [R] copying dataset and immediatly writing the copy to a different location 
> fails
> -
>
> Key: ARROW-17373
> URL: https://issues.apache.org/jira/browse/ARROW-17373
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
> Environment: Ubuntu 22.10
>Reporter: Egill Axfjord Fridgeirsson
>Priority: Major
>
> When I copy large feather files, open a dataset from that file and 
> immediately write that dataset to a new location I get the following error:
>  
> ```Error: Invalid: Expected to read 144 metadata bytes but got 0```
>  
> I have made a reproducible example below:
>  
> ``` r
> df <- data.frame(replicate(1,sample(0:1,100e6,rep=TRUE)))
> savePath <- file.path(tempdir(), 'arrowTest')
> if (!dir.exists(savePath)) {
>   dir.create(savePath)
> }
> arrow::write_feather(df, file.path(savePath, 'part-0.feather'))
> copyPath <- file.path(tempdir(),'arrowTest')
> if (!dir.exists(copyPath)) {
>   dir.create(copyPath)
> }
> writePath <- file.path(tempdir(), 'arrowTest')
> if (!dir.exists(writePath)) {
>   dir.create(writePath)
> }
> arrow::copy_files(savePath, copyPath)
> dataset <- arrow::open_dataset(copyPath, format='feather')
> arrow::write_dataset(dataset = dataset, path = writePath, format = 'feather')
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-14999) [C++] List types with different field names are not equal

2022-08-11 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578368#comment-17578368
 ] 

Joris Van den Bossche commented on ARROW-14999:
---

bq.  am unable to determine any real use case for a custom field name

We are starting to use custom field names in the extension types for geospatial 
data 
(https://github.com/geopandas/geo-arrow-spec/blob/main/extension-types.md#concrete-examples-of-extension-type-metadata,
 https://github.com/paleolimbot/geoarrow/). 
I don't know if we currently already rely on those names, but we had the 
intention to do so in the future (cc [~paleolimbot])

> [C++] List types with different field names are not equal
> -
>
> Key: ARROW-14999
> URL: https://issues.apache.org/jira/browse/ARROW-14999
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 6.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When comparing map types, the names of the fields are ignored. This was 
> introduced in ARROW-7173.
> However for list types, they are not ignored. For example,
> {code:python}
> In [6]: l1 = pa.list_(pa.field("val", pa.int64()))
> In [7]: l2 = pa.list_(pa.int64())
> In [8]: l1
> Out[8]: ListType(list)
> In [9]: l2
> Out[9]: ListType(list)
> In [10]: l1 == l2
> Out[10]: False
> {code}
> Should we make list type comparison ignore field names too?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16993) [C++] cmake: `cannot create imported target "Boost::headers"`

2022-08-11 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-16993:

Fix Version/s: 9.0.1

> [C++] cmake: `cannot create imported target "Boost::headers"`
> -
>
> Key: ARROW-16993
> URL: https://issues.apache.org/jira/browse/ARROW-16993
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jefferson Carpenter
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.1
>
> Attachments: CMakeError.log, CMakeOutput.log, cmake_stdout.txt
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Hi,
> I just tried to build arrow/cpp using cmake, and on the master branch I get 
> the error
>  
> {code:java}
> arrow-build $ cmake -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=DEBUG ../arrow/cpp/
> ...
> CMake Error at cmake_modules/ThirdpartyToolchain.cmake:873 (add_library):
>   add_library cannot create imported target "Boost::headers" because another
>   target with the same name already exists.
> Call Stack (most recent call first):
>   cmake_modules/ThirdpartyToolchain.cmake:139 (build_boost)
>   cmake_modules/ThirdpartyToolchain.cmake:236 (build_dependency)
>   cmake_modules/ThirdpartyToolchain.cmake:1014 (resolve_dependency)
>   CMakeLists.txt:552 (include) 
> ...
> -- Configuring incomplete, errors occurred!
> See also "/app/arrow-build/CMakeFiles/CMakeOutput.log".
> See also "/app/arrow-build/CMakeFiles/CMakeError.log".{code}
> and CMake exits with status 1.  The project configures successfully on the 
> apache-arrow-8.0.0 tag.  Running a git bisect, the defect was introduced in 
> the commit:
> {noformat}
> d653b71d79fc381c43f59d3095cc1c9fb0c1cf7c
> ARROW-16168: [C++][CMake] Use target to add include paths{noformat}
> I have attached CMakeOutput.log and CMakeError.log.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17252) [R] Intermittent valgrind failure

2022-08-11 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-17252:

Fix Version/s: 9.0.1

> [R] Intermittent valgrind failure
> -
>
> Key: ARROW-17252
> URL: https://issues.apache.org/jira/browse/ARROW-17252
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0, 9.0.1
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>
> A number of recent nightly builds have intermittent failures with valgrind, 
> which fails because of possibly leaked memory around an exec plan. This seems 
> related to a change in XXX that separated {{ExecPlan_prepare()}} from 
> {{ExecPlan_run()}} and added a {{ExecPlan_read_table()}} that uses 
> {{RunWithCapturedR()}}. The reported leaks vary but include ExecPlans and 
> ExecNodes and fields of those objects.
> A failed run: 
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=30310&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=24980
> Some example output:
> {noformat}
> ==5249== 14,112 (384 direct, 13,728 indirect) bytes in 1 blocks are 
> definitely lost in loss record 1,988 of 3,883
> ==5249==at 0x4849013: operator new(unsigned long) (in 
> /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==5249==by 0x10B2902B: 
> std::_Function_handler 
> (arrow::compute::ExecPlan*, std::vector std::allocator >, arrow::compute::ExecNodeOptions 
> const&), 
> arrow::compute::internal::RegisterAggregateNode(arrow::compute::ExecFactoryRegistry*)::{lambda(arrow::compute::ExecPlan*,
>  std::vector std::allocator >, arrow::compute::ExecNodeOptions 
> const&)#1}>::_M_invoke(std::_Any_data const&, arrow::compute::ExecPlan*&&, 
> std::vector std::allocator >&&, 
> arrow::compute::ExecNodeOptions const&) (exec_plan.h:60)
> ==5249==by 0xFA83A0C: 
> std::function 
> (arrow::compute::ExecPlan*, std::vector std::allocator >, arrow::compute::ExecNodeOptions 
> const&)>::operator()(arrow::compute::ExecPlan*, 
> std::vector std::allocator >, arrow::compute::ExecNodeOptions 
> const&) const (std_function.h:622)
> ==5249== 14,528 (160 direct, 14,368 indirect) bytes in 1 blocks are 
> definitely lost in loss record 1,989 of 3,883
> ==5249==at 0x4849013: operator new(unsigned long) (in 
> /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==5249==by 0x10096CB7: arrow::FutureImpl::Make() (future.cc:187)
> ==5249==by 0xFCB6F9A: arrow::Future::Make() 
> (future.h:420)
> ==5249==by 0x101AE927: ExecPlanImpl (exec_plan.cc:50)
> ==5249==by 0x101AE927: 
> arrow::compute::ExecPlan::Make(arrow::compute::ExecContext*, 
> std::shared_ptr) (exec_plan.cc:355)
> ==5249==by 0xFA77BA2: ExecPlan_create(bool) (compute-exec.cpp:45)
> ==5249==by 0xF9FAE9F: _arrow_ExecPlan_create (arrowExports.cpp:868)
> ==5249==by 0x4953B60: R_doDotCall (dotcode.c:601)
> ==5249==by 0x49C2C16: bcEval (eval.c:7682)
> ==5249==by 0x499DB95: Rf_eval (eval.c:748)
> ==5249==by 0x49A0904: R_execClosure (eval.c:1918)
> ==5249==by 0x49A05B7: Rf_applyClosure (eval.c:1844)
> ==5249==by 0x49B2122: bcEval (eval.c:7094)
> ==5249== 
> ==5249== 36,322 (416 direct, 35,906 indirect) bytes in 1 blocks are 
> definitely lost in loss record 2,929 of 3,883
> ==5249==at 0x4849013: operator new(unsigned long) (in 
> /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==5249==by 0x10214F92: arrow::compute::TaskScheduler::Make() 
> (task_util.cc:421)
> ==5249==by 0x101AEA6C: ExecPlanImpl (exec_plan.cc:50)
> ==5249==by 0x101AEA6C: 
> arrow::compute::ExecPlan::Make(arrow::compute::ExecContext*, 
> std::shared_ptr) (exec_plan.cc:355)
> ==5249==by 0xFA77BA2: ExecPlan_create(bool) (compute-exec.cpp:45)
> ==5249==by 0xF9FAE9F: _arrow_ExecPlan_create (arrowExports.cpp:868)
> ==5249==by 0x4953B60: R_doDotCall (dotcode.c:601)
> ==5249==by 0x49C2C16: bcEval (eval.c:7682)
> ==5249==by 0x499DB95: Rf_eval (eval.c:748)
> ==5249==by 0x49A0904: R_execClosure (eval.c:1918)
> ==5249==by 0x49A05B7: Rf_applyClosure (eval.c:1844)
> ==5249==by 0x49B2122: bcEval (eval.c:7094)
> ==5249==by 0x499DB95: Rf_eval (eval.c:748)
> {noformat}
> We also occasionally get leaked Schemas, and in one case a leaked InputType 
> that seemed completely unrelated to the other leaks (ARROW-17225).
> I'm wondering if these have to do with references in lambdas that get passed 
> by reference? Or perhaps a cache issue? There were some instances in previous 
> leaks where the backtrace to the {{new}} allocator was different between 
> reported leaks.



--
This message was sent by Atlassian Jira
(v8

[jira] [Created] (ARROW-17385) [Integration] Re-enable disabled Rust Flight middleware test

2022-08-11 Thread David Li (Jira)

David Li created ARROW-17385:


 Summary: [Integration] Re-enable disabled Rust Flight middleware 
test
 Key: ARROW-17385
 URL: https://issues.apache.org/jira/browse/ARROW-17385
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Integration
Reporter: David Li
Assignee: David Li


Follow-up for ARROW-10961. The linked Rust issue was fixed, so we should 
re-enable the integration test case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17386) [R] strptime tests not robust across platforms

2022-08-11 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-17386:
---

 Summary: [R] strptime tests not robust across platforms
 Key: ARROW-17386
 URL: https://issues.apache.org/jira/browse/ARROW-17386
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Neal Richardson
 Fix For: 9.0.1


After the 9.0.0 release was accepted on CRAN, Ripley emailed me about a test 
failure on some other machine, which has not yet shown up on CRAN checks:

{code}
── Failure (test-dplyr-funcs-datetime.R:183:5): strptime ───
  `object` (`actual`) not equal to `expected` (`expected`).
  
  actual vs expected
  x
  - actual[1, ]  NA
  + expected[1, ]   1999-03-16 12:22:20
  - actual[2, ]  NA
  + expected[2, ]   1999-10-08 18:02:24
  - actual[3, ]  NA
  + expected[3, ]   1999-04-04 03:52:27
  - actual[4, ]  NA
  + expected[4, ]   1999-05-28 11:35:45
  - actual[5, ]  NA
  + expected[5, ]   1999-03-16 08:08:55
  - actual[6, ]  NA
  + expected[6, ]   1999-09-25 00:19:59
  - actual[7, ]  NA
  + expected[7, ]   1999-10-12 20:47:55
  - actual[8, ]  NA
  + expected[8, ]   1999-04-15 20:36:12
  - actual[9, ]  NA
  + expected[9, ]   1999-05-01 03:55:23
  - actual[10, ] NA
  + expected[10, ]  1999-12-15 01:19:05
  and 90 more ...
  
   actual$x | expected$x   
   [1] NA   - "1999-03-16 12:22:20" [1]
   [2] NA   - "1999-10-08 18:02:24" [2]
   [3] NA   - "1999-04-04 03:52:27" [3]
   [4] NA   - "1999-05-28 11:35:45" [4]
   [5] NA   - "1999-03-16 08:08:55" [5]
   [6] NA   - "1999-09-25 00:19:59" [6]
   [7] NA   - "1999-10-12 20:47:55" [7]
   [8] NA   - "1999-04-15 20:36:12" [8]
   [9] NA   - "1999-05-01 03:55:23" [9]
  [10] NA   - "1999-12-15 01:19:05" [10]   
   ... ......   and 90 more ...
  Backtrace:
  ▆
   1. └─arrow:::expect_equal(...) at test-dplyr-funcs-datetime.R:183:4
   2.   └─testthat::expect_equal(...) at 
tests/testthat/helper-expectation.R:42:4
  
  [ FAIL 1 | WARN 0 | SKIP 79 | PASS 8173 ]
{code}

It appears that one of the strptime tests returns NA in Arrow but not in R. 
Reading the test, it uses R to first strftime and then tests that Arrow and R 
both strptime that back, so it could be an R quirk: R recognizes and can do 
something with this strptime token round trip, but our library doesn't. 

Unfortunately, I don't know which token it is though because these tests are 
run in a for loop and the failure message doesn't say which token is the one 
that is failing. testthat does provide some facilities for reporting useful 
things within a loop, so we should wire those up. 

In addition to better handling of tests in a loop, we should probably just skip 
this whole thing on CRAN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17385) [Integration] Re-enable disabled Rust Flight middleware test

2022-08-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17385:
---
Labels: pull-request-available  (was: )

> [Integration] Re-enable disabled Rust Flight middleware test
> 
>
> Key: ARROW-17385
> URL: https://issues.apache.org/jira/browse/ARROW-17385
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Follow-up for ARROW-10961. The linked Rust issue was fixed, so we should 
> re-enable the integration test case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17386) [R] strptime tests not robust across platforms

2022-08-11 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578465#comment-17578465
 ] 

Neal Richardson commented on ARROW-17386:
-

We also might want to be explicit about which ones we expect to be all NA and 
which ones should be valid. My guess from reading the strptime docs is that on 
every other platform, this particular token is also invalid in R, but it is 
valid on BDR's machine.

> [R] strptime tests not robust across platforms
> --
>
> Key: ARROW-17386
> URL: https://issues.apache.org/jira/browse/ARROW-17386
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 9.0.1
>
>
> After the 9.0.0 release was accepted on CRAN, Ripley emailed me about a test 
> failure on some other machine, which has not yet shown up on CRAN checks:
> {code}
> ── Failure (test-dplyr-funcs-datetime.R:183:5): strptime 
> ───
>   `object` (`actual`) not equal to `expected` (`expected`).
>   
>   actual vs expected
>   x
>   - actual[1, ]  NA
>   + expected[1, ]   1999-03-16 12:22:20
>   - actual[2, ]  NA
>   + expected[2, ]   1999-10-08 18:02:24
>   - actual[3, ]  NA
>   + expected[3, ]   1999-04-04 03:52:27
>   - actual[4, ]  NA
>   + expected[4, ]   1999-05-28 11:35:45
>   - actual[5, ]  NA
>   + expected[5, ]   1999-03-16 08:08:55
>   - actual[6, ]  NA
>   + expected[6, ]   1999-09-25 00:19:59
>   - actual[7, ]  NA
>   + expected[7, ]   1999-10-12 20:47:55
>   - actual[8, ]  NA
>   + expected[8, ]   1999-04-15 20:36:12
>   - actual[9, ]  NA
>   + expected[9, ]   1999-05-01 03:55:23
>   - actual[10, ] NA
>   + expected[10, ]  1999-12-15 01:19:05
>   and 90 more ...
>   
>actual$x | expected$x   
>[1] NA   - "1999-03-16 12:22:20" [1]
>[2] NA   - "1999-10-08 18:02:24" [2]
>[3] NA   - "1999-04-04 03:52:27" [3]
>[4] NA   - "1999-05-28 11:35:45" [4]
>[5] NA   - "1999-03-16 08:08:55" [5]
>[6] NA   - "1999-09-25 00:19:59" [6]
>[7] NA   - "1999-10-12 20:47:55" [7]
>[8] NA   - "1999-04-15 20:36:12" [8]
>[9] NA   - "1999-05-01 03:55:23" [9]
>   [10] NA   - "1999-12-15 01:19:05" [10]   
>... ......   and 90 more ...
>   Backtrace:
>   ▆
>1. └─arrow:::expect_equal(...) at test-dplyr-funcs-datetime.R:183:4
>2.   └─testthat::expect_equal(...) at 
> tests/testthat/helper-expectation.R:42:4
>   
>   [ FAIL 1 | WARN 0 | SKIP 79 | PASS 8173 ]
> {code}
> It appears that one of the strptime tests returns NA in Arrow but not in R. 
> Reading the test, it uses R to first strftime and then tests that Arrow and R 
> both strptime that back, so it could be an R quirk: R recognizes and can do 
> something with this strptime token round trip, but our library doesn't. 
> Unfortunately, I don't know which token it is though because these tests are 
> run in a for loop and the failure message doesn't say which token is the one 
> that is failing. testthat does provide some facilities for reporting useful 
> things within a loop, so we should wire those up. 
> In addition to better handling of tests in a loop, we should probably just 
> skip this whole thing on CRAN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17386) [R] strptime tests not robust across platforms

2022-08-11 Thread Rok Mihevc (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578468#comment-17578468
 ] 

Rok Mihevc commented on ARROW-17386:


It's either strftime or strptime working in an unexpected way. We should 
rewrite the test to be more explicit and only test strptime (like we do 
parse_date_time: 
https://github.com/apache/arrow/blob/master/r/tests/testthat/test-dplyr-funcs-datetime.R#L2382-L2400).

> [R] strptime tests not robust across platforms
> --
>
> Key: ARROW-17386
> URL: https://issues.apache.org/jira/browse/ARROW-17386
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 9.0.1
>
>
> After the 9.0.0 release was accepted on CRAN, Ripley emailed me about a test 
> failure on some other machine, which has not yet shown up on CRAN checks:
> {code}
> ── Failure (test-dplyr-funcs-datetime.R:183:5): strptime 
> ───
>   `object` (`actual`) not equal to `expected` (`expected`).
>   
>   actual vs expected
>   x
>   - actual[1, ]  NA
>   + expected[1, ]   1999-03-16 12:22:20
>   - actual[2, ]  NA
>   + expected[2, ]   1999-10-08 18:02:24
>   - actual[3, ]  NA
>   + expected[3, ]   1999-04-04 03:52:27
>   - actual[4, ]  NA
>   + expected[4, ]   1999-05-28 11:35:45
>   - actual[5, ]  NA
>   + expected[5, ]   1999-03-16 08:08:55
>   - actual[6, ]  NA
>   + expected[6, ]   1999-09-25 00:19:59
>   - actual[7, ]  NA
>   + expected[7, ]   1999-10-12 20:47:55
>   - actual[8, ]  NA
>   + expected[8, ]   1999-04-15 20:36:12
>   - actual[9, ]  NA
>   + expected[9, ]   1999-05-01 03:55:23
>   - actual[10, ] NA
>   + expected[10, ]  1999-12-15 01:19:05
>   and 90 more ...
>   
>actual$x | expected$x   
>[1] NA   - "1999-03-16 12:22:20" [1]
>[2] NA   - "1999-10-08 18:02:24" [2]
>[3] NA   - "1999-04-04 03:52:27" [3]
>[4] NA   - "1999-05-28 11:35:45" [4]
>[5] NA   - "1999-03-16 08:08:55" [5]
>[6] NA   - "1999-09-25 00:19:59" [6]
>[7] NA   - "1999-10-12 20:47:55" [7]
>[8] NA   - "1999-04-15 20:36:12" [8]
>[9] NA   - "1999-05-01 03:55:23" [9]
>   [10] NA   - "1999-12-15 01:19:05" [10]   
>... ......   and 90 more ...
>   Backtrace:
>   ▆
>1. └─arrow:::expect_equal(...) at test-dplyr-funcs-datetime.R:183:4
>2.   └─testthat::expect_equal(...) at 
> tests/testthat/helper-expectation.R:42:4
>   
>   [ FAIL 1 | WARN 0 | SKIP 79 | PASS 8173 ]
> {code}
> It appears that one of the strptime tests returns NA in Arrow but not in R. 
> Reading the test, it uses R to first strftime and then tests that Arrow and R 
> both strptime that back, so it could be an R quirk: R recognizes and can do 
> something with this strptime token round trip, but our library doesn't. 
> Unfortunately, I don't know which token it is though because these tests are 
> run in a for loop and the failure message doesn't say which token is the one 
> that is failing. testthat does provide some facilities for reporting useful 
> things within a loop, so we should wire those up. 
> In addition to better handling of tests in a loop, we should probably just 
> skip this whole thing on CRAN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-17386) [R] strptime tests not robust across platforms

2022-08-11 Thread Rok Mihevc (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc reassigned ARROW-17386:
--

Assignee: Rok Mihevc

> [R] strptime tests not robust across platforms
> --
>
> Key: ARROW-17386
> URL: https://issues.apache.org/jira/browse/ARROW-17386
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Rok Mihevc
>Priority: Major
> Fix For: 9.0.1
>
>
> After the 9.0.0 release was accepted on CRAN, Ripley emailed me about a test 
> failure on some other machine, which has not yet shown up on CRAN checks:
> {code}
> ── Failure (test-dplyr-funcs-datetime.R:183:5): strptime 
> ───
>   `object` (`actual`) not equal to `expected` (`expected`).
>   
>   actual vs expected
>   x
>   - actual[1, ]  NA
>   + expected[1, ]   1999-03-16 12:22:20
>   - actual[2, ]  NA
>   + expected[2, ]   1999-10-08 18:02:24
>   - actual[3, ]  NA
>   + expected[3, ]   1999-04-04 03:52:27
>   - actual[4, ]  NA
>   + expected[4, ]   1999-05-28 11:35:45
>   - actual[5, ]  NA
>   + expected[5, ]   1999-03-16 08:08:55
>   - actual[6, ]  NA
>   + expected[6, ]   1999-09-25 00:19:59
>   - actual[7, ]  NA
>   + expected[7, ]   1999-10-12 20:47:55
>   - actual[8, ]  NA
>   + expected[8, ]   1999-04-15 20:36:12
>   - actual[9, ]  NA
>   + expected[9, ]   1999-05-01 03:55:23
>   - actual[10, ] NA
>   + expected[10, ]  1999-12-15 01:19:05
>   and 90 more ...
>   
>actual$x | expected$x   
>[1] NA   - "1999-03-16 12:22:20" [1]
>[2] NA   - "1999-10-08 18:02:24" [2]
>[3] NA   - "1999-04-04 03:52:27" [3]
>[4] NA   - "1999-05-28 11:35:45" [4]
>[5] NA   - "1999-03-16 08:08:55" [5]
>[6] NA   - "1999-09-25 00:19:59" [6]
>[7] NA   - "1999-10-12 20:47:55" [7]
>[8] NA   - "1999-04-15 20:36:12" [8]
>[9] NA   - "1999-05-01 03:55:23" [9]
>   [10] NA   - "1999-12-15 01:19:05" [10]   
>... ......   and 90 more ...
>   Backtrace:
>   ▆
>1. └─arrow:::expect_equal(...) at test-dplyr-funcs-datetime.R:183:4
>2.   └─testthat::expect_equal(...) at 
> tests/testthat/helper-expectation.R:42:4
>   
>   [ FAIL 1 | WARN 0 | SKIP 79 | PASS 8173 ]
> {code}
> It appears that one of the strptime tests returns NA in Arrow but not in R. 
> Reading the test, it uses R to first strftime and then tests that Arrow and R 
> both strptime that back, so it could be an R quirk: R recognizes and can do 
> something with this strptime token round trip, but our library doesn't. 
> Unfortunately, I don't know which token it is though because these tests are 
> run in a for loop and the failure message doesn't say which token is the one 
> that is failing. testthat does provide some facilities for reporting useful 
> things within a loop, so we should wire those up. 
> In addition to better handling of tests in a loop, we should probably just 
> skip this whole thing on CRAN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17386) [R] strptime tests not robust across platforms

2022-08-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17386:
---
Labels: pull-request-available  (was: )

> [R] strptime tests not robust across platforms
> --
>
> Key: ARROW-17386
> URL: https://issues.apache.org/jira/browse/ARROW-17386
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After the 9.0.0 release was accepted on CRAN, Ripley emailed me about a test 
> failure on some other machine, which has not yet shown up on CRAN checks:
> {code}
> ── Failure (test-dplyr-funcs-datetime.R:183:5): strptime 
> ───
>   `object` (`actual`) not equal to `expected` (`expected`).
>   
>   actual vs expected
>   x
>   - actual[1, ]  NA
>   + expected[1, ]   1999-03-16 12:22:20
>   - actual[2, ]  NA
>   + expected[2, ]   1999-10-08 18:02:24
>   - actual[3, ]  NA
>   + expected[3, ]   1999-04-04 03:52:27
>   - actual[4, ]  NA
>   + expected[4, ]   1999-05-28 11:35:45
>   - actual[5, ]  NA
>   + expected[5, ]   1999-03-16 08:08:55
>   - actual[6, ]  NA
>   + expected[6, ]   1999-09-25 00:19:59
>   - actual[7, ]  NA
>   + expected[7, ]   1999-10-12 20:47:55
>   - actual[8, ]  NA
>   + expected[8, ]   1999-04-15 20:36:12
>   - actual[9, ]  NA
>   + expected[9, ]   1999-05-01 03:55:23
>   - actual[10, ] NA
>   + expected[10, ]  1999-12-15 01:19:05
>   and 90 more ...
>   
>actual$x | expected$x   
>[1] NA   - "1999-03-16 12:22:20" [1]
>[2] NA   - "1999-10-08 18:02:24" [2]
>[3] NA   - "1999-04-04 03:52:27" [3]
>[4] NA   - "1999-05-28 11:35:45" [4]
>[5] NA   - "1999-03-16 08:08:55" [5]
>[6] NA   - "1999-09-25 00:19:59" [6]
>[7] NA   - "1999-10-12 20:47:55" [7]
>[8] NA   - "1999-04-15 20:36:12" [8]
>[9] NA   - "1999-05-01 03:55:23" [9]
>   [10] NA   - "1999-12-15 01:19:05" [10]   
>... ......   and 90 more ...
>   Backtrace:
>   ▆
>1. └─arrow:::expect_equal(...) at test-dplyr-funcs-datetime.R:183:4
>2.   └─testthat::expect_equal(...) at 
> tests/testthat/helper-expectation.R:42:4
>   
>   [ FAIL 1 | WARN 0 | SKIP 79 | PASS 8173 ]
> {code}
> It appears that one of the strptime tests returns NA in Arrow but not in R. 
> Reading the test, it uses R to first strftime and then tests that Arrow and R 
> both strptime that back, so it could be an R quirk: R recognizes and can do 
> something with this strptime token round trip, but our library doesn't. 
> Unfortunately, I don't know which token it is though because these tests are 
> run in a for loop and the failure message doesn't say which token is the one 
> that is failing. testthat does provide some facilities for reporting useful 
> things within a loop, so we should wire those up. 
> In addition to better handling of tests in a loop, we should probably just 
> skip this whole thing on CRAN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND

2022-08-11 Thread Shane Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578526#comment-17578526
 ] 

Shane Brennan commented on ARROW-17374:
---

Thanks Kouhei and Neal for responding. 

The conda install fails both for the release and the nightly build. Seems like 
the Python 3.10 on the system doesn't play too well with the conda install 
package. The conda error is. 
{code:java}
(R) sh-4.2$ conda install -c conda-forge --strict-channel-priority r-arrow
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible 
solve.
Solving environment: failed with repodata from current_repodata.json, will 
retry with next repodata source.

ResolvePackageNotFound: 
  - python=3.1{code}
However, I mostly was trying to install arrow within R using this setup

 
{code:java}
Sys.setenv(NOT_CRAN = TRUE)
Sys.setenv("ARROW_PARQUET" = "ON")
Sys.setenv("ARROW_S3" = "ON")
Sys.setenv("ARROW_WITH_SNAPPY" = "ON")
Sys.setenv("ARROW_R_DEV" = TRUE)
install.packages("arrow", repos = 
"https://arrow-r-nightly.s3.amazonaws.com";){code}
 

> [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
> --
>
> Key: ARROW-17374
> URL: https://issues.apache.org/jira/browse/ARROW-17374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0, 9.0.0, 8.0.1
> Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64
>Reporter: Shane Brennan
>Priority: Blocker
>
> I've been trying to install Arrow on an R notebook within AWS SageMaker. 
> SageMaker provides Jupyter-like notebooks, with each instance running Amazon 
> Linux 2 as its OS, itself based on RHEL. 
> Trying to install a few ways, e.g., using the standard binaries, using the 
> nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all 
> still result in the following error. 
> {noformat}
> x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared 
> -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common 
> -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags 
> -Wl,--gc-sections -Wl,--allow-shlib-undefined 
> -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib 
> -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib 
> -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o 
> array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o 
> compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o 
> expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o 
> json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o 
> recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o 
> schema.o symbols.o table.o threadpool.o type_infer.o 
> -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib
>  -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz 
> SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread 
> -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto 
> -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR
> x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or 
> directory
> make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: 
> arrow.so] Error 1{noformat}
> Snappy is installed on the systems, and both shared object (.so) and cmake 
> files are there, where I've tried setting the system env variables Snappy_DIR 
> and Snappy_LIB to point at them, but to no avail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND

2022-08-11 Thread Shane Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578526#comment-17578526
 ] 

Shane Brennan edited comment on ARROW-17374 at 8/11/22 3:16 PM:


Thanks Kouhei and Neal for responding. 

The conda install fails both for the release and the nightly build. Seems like 
the Python 3.10 on the system doesn't play too well with the conda install 
package. The conda error is. 
{code:java}
(R) sh-4.2$ conda install -c conda-forge --strict-channel-priority r-arrow
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible 
solve.
Solving environment: failed with repodata from current_repodata.json, will 
retry with next repodata source.

ResolvePackageNotFound: 
  - python=3.1{code}
However, I mostly was trying to install arrow within R using this setup
{code:java}
Sys.setenv(NOT_CRAN = TRUE)
Sys.setenv("ARROW_PARQUET" = "ON")
Sys.setenv("ARROW_S3" = "ON")
Sys.setenv("ARROW_WITH_SNAPPY" = "ON")
Sys.setenv("ARROW_R_DEV" = TRUE)
install.packages("arrow", repos = 
"https://arrow-r-nightly.s3.amazonaws.com";){code}


was (Author: JIRAUSER294234):
Thanks Kouhei and Neal for responding. 

The conda install fails both for the release and the nightly build. Seems like 
the Python 3.10 on the system doesn't play too well with the conda install 
package. The conda error is. 
{code:java}
(R) sh-4.2$ conda install -c conda-forge --strict-channel-priority r-arrow
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible 
solve.
Solving environment: failed with repodata from current_repodata.json, will 
retry with next repodata source.

ResolvePackageNotFound: 
  - python=3.1{code}
However, I mostly was trying to install arrow within R using this setup

 
{code:java}
Sys.setenv(NOT_CRAN = TRUE)
Sys.setenv("ARROW_PARQUET" = "ON")
Sys.setenv("ARROW_S3" = "ON")
Sys.setenv("ARROW_WITH_SNAPPY" = "ON")
Sys.setenv("ARROW_R_DEV" = TRUE)
install.packages("arrow", repos = 
"https://arrow-r-nightly.s3.amazonaws.com";){code}
 

> [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
> --
>
> Key: ARROW-17374
> URL: https://issues.apache.org/jira/browse/ARROW-17374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0, 9.0.0, 8.0.1
> Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64
>Reporter: Shane Brennan
>Priority: Blocker
>
> I've been trying to install Arrow on an R notebook within AWS SageMaker. 
> SageMaker provides Jupyter-like notebooks, with each instance running Amazon 
> Linux 2 as its OS, itself based on RHEL. 
> Trying to install a few ways, e.g., using the standard binaries, using the 
> nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all 
> still result in the following error. 
> {noformat}
> x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared 
> -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common 
> -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags 
> -Wl,--gc-sections -Wl,--allow-shlib-undefined 
> -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib 
> -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib 
> -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o 
> array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o 
> compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o 
> expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o 
> json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o 
> recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o 
> schema.o symbols.o table.o threadpool.o type_infer.o 
> -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib
>  -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz 
> SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread 
> -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto 
> -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR
> x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or 
> directory
> make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: 
> arrow.so] Error 1{noformat}
> Snappy is installed on the systems, and both shared object (.so) and cmake 
> files are there, where I've tried setting the system env variables Snappy_DIR 
> and Snappy_LIB to point at them, but to no avail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-14999) [C++] List types with different field names are not equal

2022-08-11 Thread Will Jones (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578528#comment-17578528
 ] 

Will Jones commented on ARROW-14999:


Do you expect to be able to roundtrip that from Parquet? It seems like the 
conclusion of discussion in ARROW-11497 was that we should transition in the 
long term towards always using "element", but maybe we would still be able to 
roundtrip by casting back based on the Arrow schema saved in the metadata?

> [C++] List types with different field names are not equal
> -
>
> Key: ARROW-14999
> URL: https://issues.apache.org/jira/browse/ARROW-14999
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 6.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When comparing map types, the names of the fields are ignored. This was 
> introduced in ARROW-7173.
> However for list types, they are not ignored. For example,
> {code:python}
> In [6]: l1 = pa.list_(pa.field("val", pa.int64()))
> In [7]: l2 = pa.list_(pa.int64())
> In [8]: l1
> Out[8]: ListType(list)
> In [9]: l2
> Out[9]: ListType(list)
> In [10]: l1 == l2
> Out[10]: False
> {code}
> Should we make list type comparison ignore field names too?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17387) [R] Implement dplyr::across() inside filter()

2022-08-11 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-17387:


 Summary: [R] Implement dplyr::across() inside filter()
 Key: ARROW-17387
 URL: https://issues.apache.org/jira/browse/ARROW-17387
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane
 Fix For: 10.0.0


ARROW-11699 adds the ability to call dplyr::across() inside dplyr::mutate().  
Once this is merged, we should also add the ability to do so within 
dplyr::summarise().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17387) [R] Implement dplyr::across() inside filter()

2022-08-11 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17387:
-
Description: ARROW-11699 adds the ability to call dplyr::across() inside 
dplyr::mutate().  Once this is merged, we should also add the ability to do so 
within dplyr::filter().  (was: ARROW-11699 adds the ability to call 
dplyr::across() inside dplyr::mutate().  Once this is merged, we should also 
add the ability to do so within dplyr::summarise().)

> [R] Implement dplyr::across() inside filter()
> -
>
> Key: ARROW-17387
> URL: https://issues.apache.org/jira/browse/ARROW-17387
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: dplyr
> Fix For: 10.0.0
>
>
> ARROW-11699 adds the ability to call dplyr::across() inside dplyr::mutate().  
> Once this is merged, we should also add the ability to do so within 
> dplyr::filter().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17388) Prevent corrupting files with Multiple matches for FieldRef.Name

2022-08-11 Thread Grayden Shand (Jira)

Grayden Shand created ARROW-17388:
-

 Summary: Prevent corrupting files with Multiple matches for 
FieldRef.Name
 Key: ARROW-17388
 URL: https://issues.apache.org/jira/browse/ARROW-17388
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
 Environment: MacOS, Python 3.10.3
Reporter: Grayden Shand


{*}Version{*}: pyarrow 9.0.0

 

*Description*

Users can add a column with the the same name as an existing column to a table 
via `pyarrow.Table.add_column()`.

 

Additionally, that table can be written to a parquet file with 
`pyarrow.parquet.write_table()`.

 

However, the written file cannot be read with `pyarrow.parquet.read_table()` 
due to having multiple columns with the same name.

 

Flagging this as a bug because I believe anything that is successfully written 
by `write_table()` should be readable by `read_table()`.

 

*Minimum reproducible example*

```

>>> import pyarrow.parquet as pq
>>> import pyarrow as pa
>>> t = pa.Table.from_pydict(\{'a': [1,2,3]})
>>> pq.write_table(t.add_column(0, 'a', pa.array([1.1,2.2,3.3])), 
>>> 'test.parquet')
>>> pq.read_table('test.parquet')
pyarrow.lib.ArrowInvalid: Multiple matches for FieldRef.Name(a) in a: double
a: int64
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-14045) [R] Support for .keep_all = TRUE with distinct()

2022-08-11 Thread Kazuyuki Ura (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578538#comment-17578538
 ] 

Kazuyuki Ura commented on ARROW-14045:
--

`hash_one` function in C++ does NOT return null value, so result will be 
different if first row in group contain null value.
See [link Compute Function|https://arrow.apache.org/docs/cpp/compute.html]

> [R] Support for .keep_all = TRUE with distinct() 
> -
>
> Key: ARROW-14045
> URL: https://issues.apache.org/jira/browse/ARROW-14045
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Kazuyuki Ura
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-14045) [R] Support for .keep_all = TRUE with distinct()

2022-08-11 Thread Kazuyuki Ura (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578538#comment-17578538
 ] 

Kazuyuki Ura edited comment on ARROW-14045 at 8/11/22 3:41 PM:
---

`hash_one` function in C++ does NOT return null value, so result will be 
different if first row in group contain null value.
See [Compute Function|https://arrow.apache.org/docs/cpp/compute.html]


was (Author: JIRAUSER293161):
`hash_one` function in C++ does NOT return null value, so result will be 
different if first row in group contain null value.
See [link Compute Function|https://arrow.apache.org/docs/cpp/compute.html]

> [R] Support for .keep_all = TRUE with distinct() 
> -
>
> Key: ARROW-14045
> URL: https://issues.apache.org/jira/browse/ARROW-14045
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Kazuyuki Ura
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-14045) [R] Support for .keep_all = TRUE with distinct()

2022-08-11 Thread Kazuyuki Ura (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Ura reassigned ARROW-14045:


Assignee: (was: Kazuyuki Ura)

> [R] Support for .keep_all = TRUE with distinct() 
> -
>
> Key: ARROW-14045
> URL: https://issues.apache.org/jira/browse/ARROW-14045
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-14045) [R] Support for .keep_all = TRUE with distinct()

2022-08-11 Thread Kazuyuki Ura (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578541#comment-17578541
 ] 

Kazuyuki Ura commented on ARROW-14045:
--

and I close my pull request... I will think it how to implement .keep_all = 
TRUE again.

> [R] Support for .keep_all = TRUE with distinct() 
> -
>
> Key: ARROW-14045
> URL: https://issues.apache.org/jira/browse/ARROW-14045
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-14045) [R] Support for .keep_all = TRUE with distinct()

2022-08-11 Thread Kazuyuki Ura (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578541#comment-17578541
 ] 

Kazuyuki Ura edited comment on ARROW-14045 at 8/11/22 3:44 PM:
---

and I closed my pull request... I will think it how to implement .keep_all = 
TRUE again.


was (Author: JIRAUSER293161):
and I close my pull request... I will think it how to implement .keep_all = 
TRUE again.

> [R] Support for .keep_all = TRUE with distinct() 
> -
>
> Key: ARROW-14045
> URL: https://issues.apache.org/jira/browse/ARROW-14045
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-14045) [R] Support for .keep_all = TRUE with distinct()

2022-08-11 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578560#comment-17578560
 ] 

Neal Richardson commented on ARROW-14045:
-

> `hash_one` function in C++ does NOT return null value, so result will be 
> different if first row in group contain null value.

Should that be added as a FunctionOption in C++?

> [R] Support for .keep_all = TRUE with distinct() 
> -
>
> Key: ARROW-14045
> URL: https://issues.apache.org/jira/browse/ARROW-14045
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17389) In pyarrow, conftest is installed even when tests are not

2022-08-11 Thread Benjamin Beasley (Jira)

Benjamin Beasley created ARROW-17389:


 Summary: In pyarrow, conftest is installed even when tests are not
 Key: ARROW-17389
 URL: https://issues.apache.org/jira/browse/ARROW-17389
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Benjamin Beasley


When PYARROW_INSTALL_TESTS is set to 0, the packages keyword to 
setuptools.setup is:

{{find_namespace_packages(include=['pyarrow*'], exclude=["pyarrow.tests*"])}}

Since pyarrow.conftest is associated with the tests, I think it should be added 
to the list of modules to exclude in this case, i.e.,

{{find_namespace_packages(include=['pyarrow*'], exclude=["pyarrow.tests*", 
"pyarrow.conftest"])}}

(As a separate issue, it doesn’t seem like any of these excludes is having any 
effect for me anyway, as I find the tests are still included in wheels no 
matter what.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-9773) [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array

2022-08-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9773:
--
Labels: kernel pull-request-available  (was: kernel)

> [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array
> ---
>
> Key: ARROW-9773
> URL: https://issues.apache.org/jira/browse/ARROW-9773
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: David Li
>Assignee: Will Jones
>Priority: Major
>  Labels: kernel, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Take() currently concatenates ChunkedArrays first. However, this breaks down 
> when calling Take() from a ChunkedArray or Table where concatenating the 
> arrays would result in an array that's too large. While inconvenient to 
> implement, it would be useful if this case were handled.
> This could be done as a higher-level wrapper around Take(), perhaps.
> Example in Python:
> {code:python}
> >>> import pyarrow as pa
> >>> pa.__version__
> '1.0.0'
> >>> rb1 = pa.RecordBatch.from_arrays([["a" * 2**30]], names=["a"])
> >>> rb2 = pa.RecordBatch.from_arrays([["b" * 2**30]], names=["a"])
> >>> table = pa.Table.from_batches([rb1, rb2], schema=rb1.schema)
> >>> table.take([1, 0])
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 1145, in pyarrow.lib.Table.take
>   File 
> "/home/lidavidm/Code/twosigma/arrow/venv/lib/python3.8/site-packages/pyarrow/compute.py",
>  line 268, in take
> return call_function('take', [data, indices], options)
>   File "pyarrow/_compute.pyx", line 298, in pyarrow._compute.call_function
>   File "pyarrow/_compute.pyx", line 192, in pyarrow._compute.Function.call
>   File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
> {code}
> In this example, it would be useful if Take() or a higher-level wrapper could 
> generate multiple record batches as output.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17388) Prevent corrupting files with Multiple matches for FieldRef.Name

2022-08-11 Thread Grayden Shand (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grayden Shand updated ARROW-17388:
--
Priority: Major  (was: Minor)

> Prevent corrupting files with Multiple matches for FieldRef.Name
> 
>
> Key: ARROW-17388
> URL: https://issues.apache.org/jira/browse/ARROW-17388
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: MacOS, Python 3.10.3
>Reporter: Grayden Shand
>Priority: Major
>
> {*}Version{*}: pyarrow 9.0.0
>  
> *Description*
> Users can add a column with the the same name as an existing column to a 
> table via `pyarrow.Table.add_column()`.
>  
> Additionally, that table can be written to a parquet file with 
> `pyarrow.parquet.write_table()`.
>  
> However, the written file cannot be read with `pyarrow.parquet.read_table()` 
> due to having multiple columns with the same name.
>  
> Flagging this as a bug because I believe anything that is successfully 
> written by `write_table()` should be readable by `read_table()`.
>  
> *Minimum reproducible example*
> ```
> >>> import pyarrow.parquet as pq
> >>> import pyarrow as pa
> >>> t = pa.Table.from_pydict(\{'a': [1,2,3]})
> >>> pq.write_table(t.add_column(0, 'a', pa.array([1.1,2.2,3.3])), 
> >>> 'test.parquet')
> >>> pq.read_table('test.parquet')
> pyarrow.lib.ArrowInvalid: Multiple matches for FieldRef.Name(a) in a: double
> a: int64
> __fragment_index: int32
> __batch_index: int32
> __last_in_fragment: bool
> __filename: string
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers

2022-08-11 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578601#comment-17578601
 ] 

Joris Van den Bossche commented on ARROW-10739:
---

That sounds good!

> [Python] Pickling a sliced array serializes all the buffers
> ---
>
> Key: ARROW-10739
> URL: https://issues.apache.org/jira/browse/ARROW-10739
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Maarten Breddels
>Assignee: Alessandro Molina
>Priority: Critical
> Fix For: 10.0.0
>
>
> If a large array is sliced, and pickled, it seems the full buffer is 
> serialized, this leads to excessive memory usage and data transfer when using 
> multiprocessing or dask.
> {code:java}
> >>> import pyarrow as pa
> >>> ar = pa.array(['foo'] * 100_000)
> >>> ar.nbytes
> 74
> >>> import pickle
> >>> len(pickle.dumps(ar.slice(10, 1)))
> 700165
> NumPy for instance
> >>> import numpy as np
> >>> ar_np = np.array(ar)
> >>> ar_np
> array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
> >>> import pickle
> >>> len(pickle.dumps(ar_np[10:11]))
> 165{code}
> I think this makes sense if you know arrow, but kind of unexpected as a user.
> Is there a workaround for this? For instance copy an arrow array to get rid 
> of the offset, and trim the buffers?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-17167) [C++][Docs] Revise C++ Documentation

2022-08-11 Thread Kae Suarez (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kae Suarez reassigned ARROW-17167:
--

Assignee: Kae Suarez

> [C++][Docs] Revise C++ Documentation
> 
>
> Key: ARROW-17167
> URL: https://issues.apache.org/jira/browse/ARROW-17167
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Kae Suarez
>Assignee: Kae Suarez
>Priority: Major
>  Labels: documentation, newbie
>
> Parent ticket for tasks that aim to revise C++ Arrow documentation.
>  
> Goal is to review the field of good documentation, and reformat/create pages 
> as necessary to improve the C++ docs as a whole. Subtasks will tackle this 
> iteratively, refer to those for more information. Suggestions from new users 
> would be incredibly valued, due to their experiences being more likely to 
> have been impacted by any lacking documentation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17377) [C++][Docs] Create tutorial content for basic Arrow, file access, compute, and datasets

2022-08-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17377:
---
Labels: pull-request-available  (was: )

> [C++][Docs] Create tutorial content for basic Arrow, file access, compute, 
> and datasets
> ---
>
> Key: ARROW-17377
> URL: https://issues.apache.org/jira/browse/ARROW-17377
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Documentation
>Reporter: Kae Suarez
>Assignee: Kae Suarez
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As per 
> [https://docs.google.com/document/d/1IFk6m97JWZZzFC3UIlLf3sxnXgFoL-l89nqSGl8bE28/edit?usp=sharing],
>  create a set of articles dedicated to introducing users to basic Arrow data 
> structures (i.e., arrays, recordbatches, tables), file access, compute 
> functions, and Arrow Datasets. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-14999) [C++] List types with different field names are not equal

2022-08-11 Thread Dewey Dunnington (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578644#comment-17578644
 ] 

Dewey Dunnington commented on ARROW-14999:
--

Yes, we do use those names to communicate some information in the geoarrow spec 
(although I don't think we need them for type equality since it doesn't affect 
the storage interpretation).

> [C++] List types with different field names are not equal
> -
>
> Key: ARROW-14999
> URL: https://issues.apache.org/jira/browse/ARROW-14999
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 6.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When comparing map types, the names of the fields are ignored. This was 
> introduced in ARROW-7173.
> However for list types, they are not ignored. For example,
> {code:python}
> In [6]: l1 = pa.list_(pa.field("val", pa.int64()))
> In [7]: l2 = pa.list_(pa.int64())
> In [8]: l1
> Out[8]: ListType(list)
> In [9]: l2
> Out[9]: ListType(list)
> In [10]: l1 == l2
> Out[10]: False
> {code}
> Should we make list type comparison ignore field names too?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17390) [Go] Add Union Scalar Types

2022-08-11 Thread Matthew Topol (Jira)

Matthew Topol created ARROW-17390:
-

 Summary: [Go] Add Union Scalar Types
 Key: ARROW-17390
 URL: https://issues.apache.org/jira/browse/ARROW-17390
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Go
Reporter: Matthew Topol
Assignee: Matthew Topol






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17390) [Go] Add Union Scalar Types

2022-08-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17390:
---
Labels: pull-request-available  (was: )

> [Go] Add Union Scalar Types
> ---
>
> Key: ARROW-17390
> URL: https://issues.apache.org/jira/browse/ARROW-17390
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17372) Arrow parquet go is missing Power (ppc64le) specific utils implementations

2022-08-11 Thread Matthew Topol (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-17372.
---
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13840
[https://github.com/apache/arrow/pull/13840]

> Arrow parquet go is missing Power (ppc64le) specific utils implementations
> --
>
> Key: ARROW-17372
> URL: https://issues.apache.org/jira/browse/ARROW-17372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go, Parquet
>Affects Versions: 8.0.0, 8.0.1
> Environment: Linux (RHEL) ppc64le
>Reporter: Marvin Giessing
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When trying to build the [feast go 
> lib|https://github.com/feast-dev/feast/tree/master/go] on ppc64le it fails 
> because parquet internal seems to miss even the basic (pure go 
> implementations for ppc64le) files. Providing e.g. `bit_packing_ppc64le.go` & 
> `unpack_bool_ppc64le.go` should solve this issue in the first place.
> It can then be enhanced by implementing the correct vector & matrix 
> intrinsics for the Power architecture (e.g. VSX or MMA) in a second step.
> {code:java}
> go build -mod=mod -buildmode=c-shared -tags cgo,ccalloc -o embedded_go.so .
> cmd had error: exit status 2 output:
> github.com/apache/arrow/go/v8/parquet/internal/utils
> /root/go/pkg/mod/github.com/apache/arrow/go/v8@v8.0.0/parquet/internal/utils/bit_reader.go:230:18:
>  undefined: unpack32
> /root/go/pkg/mod/github.com/apache/arrow/go/v8@v8.0.0/parquet/internal/utils/bit_reader.go:274:3:
>  undefined: BytesToBools
> /root/go/pkg/mod/github.com/apache/arrow/go/v8@v8.0.0/parquet/internal/utils/bit_reader.go:318:18:
>  undefined: unpack32 {code}
> I tested this already locally (with success) and create an according PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17326) [Go][FlightSQL] Add Support for FlightSQL to Go

2022-08-11 Thread Matthew Topol (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-17326.
---
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13828
[https://github.com/apache/arrow/pull/13828]

> [Go][FlightSQL] Add Support for FlightSQL to Go
> ---
>
> Key: ARROW-17326
> URL: https://issues.apache.org/jira/browse/ARROW-17326
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: FlightRPC, Go, SQL
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Also addresses https://github.com/apache/arrow/issues/12496



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17389) In pyarrow, conftest is installed even when tests are not

2022-08-11 Thread Benjamin Beasley (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578657#comment-17578657
 ] 

Benjamin Beasley commented on ARROW-17389:
--

The attached patches appear to solve both aspects of this issue for me.

> In pyarrow, conftest is installed even when tests are not
> -
>
> Key: ARROW-17389
> URL: https://issues.apache.org/jira/browse/ARROW-17389
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Benjamin Beasley
>Priority: Trivial
> Attachments: 
> 0001-Exclude-pyarrow.conftest-when-not-installing-tests.patch, 
> 0002-Exclude-package-data-from-tests-when-not-installing-.patch
>
>
> When PYARROW_INSTALL_TESTS is set to 0, the packages keyword to 
> setuptools.setup is:
> {{find_namespace_packages(include=['pyarrow*'], exclude=["pyarrow.tests*"])}}
> Since pyarrow.conftest is associated with the tests, I think it should be 
> added to the list of modules to exclude in this case, i.e.,
> {{find_namespace_packages(include=['pyarrow*'], exclude=["pyarrow.tests*", 
> "pyarrow.conftest"])}}
> (As a separate issue, it doesn’t seem like any of these excludes is having 
> any effect for me anyway, as I find the tests are still included in wheels no 
> matter what.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17389) In pyarrow, conftest is installed even when tests are not

2022-08-11 Thread Benjamin Beasley (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Beasley updated ARROW-17389:
-
Attachment: 0001-Exclude-pyarrow.conftest-when-not-installing-tests.patch
0002-Exclude-package-data-from-tests-when-not-installing-.patch

> In pyarrow, conftest is installed even when tests are not
> -
>
> Key: ARROW-17389
> URL: https://issues.apache.org/jira/browse/ARROW-17389
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Benjamin Beasley
>Priority: Trivial
> Attachments: 
> 0001-Exclude-pyarrow.conftest-when-not-installing-tests.patch, 
> 0002-Exclude-package-data-from-tests-when-not-installing-.patch
>
>
> When PYARROW_INSTALL_TESTS is set to 0, the packages keyword to 
> setuptools.setup is:
> {{find_namespace_packages(include=['pyarrow*'], exclude=["pyarrow.tests*"])}}
> Since pyarrow.conftest is associated with the tests, I think it should be 
> added to the list of modules to exclude in this case, i.e.,
> {{find_namespace_packages(include=['pyarrow*'], exclude=["pyarrow.tests*", 
> "pyarrow.conftest"])}}
> (As a separate issue, it doesn’t seem like any of these excludes is having 
> any effect for me anyway, as I find the tests are still included in wheels no 
> matter what.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17389) In pyarrow, conftest is installed even when tests are not

2022-08-11 Thread Benjamin Beasley (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Beasley updated ARROW-17389:
-
Attachment: (was: 
0001-Exclude-pyarrow.conftest-when-not-installing-tests.patch)

> In pyarrow, conftest is installed even when tests are not
> -
>
> Key: ARROW-17389
> URL: https://issues.apache.org/jira/browse/ARROW-17389
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Benjamin Beasley
>Priority: Trivial
>
> When PYARROW_INSTALL_TESTS is set to 0, the packages keyword to 
> setuptools.setup is:
> {{find_namespace_packages(include=['pyarrow*'], exclude=["pyarrow.tests*"])}}
> Since pyarrow.conftest is associated with the tests, I think it should be 
> added to the list of modules to exclude in this case, i.e.,
> {{find_namespace_packages(include=['pyarrow*'], exclude=["pyarrow.tests*", 
> "pyarrow.conftest"])}}
> (As a separate issue, it doesn’t seem like any of these excludes is having 
> any effect for me anyway, as I find the tests are still included in wheels no 
> matter what.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17389) In pyarrow, conftest is installed even when tests are not

2022-08-11 Thread Benjamin Beasley (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Beasley updated ARROW-17389:
-
Attachment: (was: 
0002-Exclude-package-data-from-tests-when-not-installing-.patch)

> In pyarrow, conftest is installed even when tests are not
> -
>
> Key: ARROW-17389
> URL: https://issues.apache.org/jira/browse/ARROW-17389
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Benjamin Beasley
>Priority: Trivial
>
> When PYARROW_INSTALL_TESTS is set to 0, the packages keyword to 
> setuptools.setup is:
> {{find_namespace_packages(include=['pyarrow*'], exclude=["pyarrow.tests*"])}}
> Since pyarrow.conftest is associated with the tests, I think it should be 
> added to the list of modules to exclude in this case, i.e.,
> {{find_namespace_packages(include=['pyarrow*'], exclude=["pyarrow.tests*", 
> "pyarrow.conftest"])}}
> (As a separate issue, it doesn’t seem like any of these excludes is having 
> any effect for me anyway, as I find the tests are still included in wheels no 
> matter what.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] (ARROW-17389) In pyarrow, conftest is installed even when tests are not

2022-08-11 Thread Benjamin Beasley (Jira)



[ https://issues.apache.org/jira/browse/ARROW-17389 ]


Benjamin Beasley deleted comment on ARROW-17389:
--

was (Author: musicinmybrain):
The attached patches appear to solve both aspects of this issue for me.

> In pyarrow, conftest is installed even when tests are not
> -
>
> Key: ARROW-17389
> URL: https://issues.apache.org/jira/browse/ARROW-17389
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Benjamin Beasley
>Priority: Trivial
>
> When PYARROW_INSTALL_TESTS is set to 0, the packages keyword to 
> setuptools.setup is:
> {{find_namespace_packages(include=['pyarrow*'], exclude=["pyarrow.tests*"])}}
> Since pyarrow.conftest is associated with the tests, I think it should be 
> added to the list of modules to exclude in this case, i.e.,
> {{find_namespace_packages(include=['pyarrow*'], exclude=["pyarrow.tests*", 
> "pyarrow.conftest"])}}
> (As a separate issue, it doesn’t seem like any of these excludes is having 
> any effect for me anyway, as I find the tests are still included in wheels no 
> matter what.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17390) [Go] Add Union Scalar Types

2022-08-11 Thread Matthew Topol (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-17390.
---
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13860
[https://github.com/apache/arrow/pull/13860]

> [Go] Add Union Scalar Types
> ---
>
> Key: ARROW-17390
> URL: https://issues.apache.org/jira/browse/ARROW-17390
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17385) [Integration] Re-enable disabled Rust Flight middleware test

2022-08-11 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-17385.
--
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13858
[https://github.com/apache/arrow/pull/13858]

> [Integration] Re-enable disabled Rust Flight middleware test
> 
>
> Key: ARROW-17385
> URL: https://issues.apache.org/jira/browse/ARROW-17385
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Follow-up for ARROW-10961. The linked Rust issue was fixed, so we should 
> re-enable the integration test case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17391) arrow::read_feather() cannot read DictionaryArray written from C#

2022-08-11 Thread Todd West (Jira)

Todd West created ARROW-17391:
-

 Summary: arrow::read_feather() cannot read DictionaryArray written 
from C#
 Key: ARROW-17391
 URL: https://issues.apache.org/jira/browse/ARROW-17391
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#, R
Affects Versions: 9.0.1
Reporter: Todd West
 Fix For: 9.0.1


This applies to Arrow 9.0.0, both the C# nuget and R package, but for some 
reason 9.0.0 isn't in the issue dropdowns' list of released versions. It also 
appears the [implementation status 
page|https://arrow.apache.org/docs/status.html#ipc-format] may be stale as the 
C#  source contains 
[DictionaryArray|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/DictionaryArray.cs]
 and a look in the debugger confirms the flags flip and the data structures 
update for 
[ArrowStreamWriter|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs]
 having correctly received both the dictionary index and value arrays it's 
given on the code paths which write a [dictionary 
batch|https://arrow.apache.org/docs/format/Columnar.html] . However, on the R 
side, read_feather() fails with

{{Error: Key error: Dictionary with id 1 not found}}

So it appears most likely either C# isn't properly emitting the dictionary 
batch, despite seeming to have all the code to do so, or something's going 
wrong in the C++ layers under R in the reading side.

Setup on the C# side is simple

{{        public static DictionaryArray CreateStringTable(Memory 
indicies, IList values)}}
{{        {}}
{{            StringArray.Builder valueArray = new();}}
{{            for (int valueIndex = 0; valueIndex < values.Count; 
++valueIndex)}}
{{            {}}
{{                valueArray.Append(values[valueIndex]);}}
{{            }}}{{            UInt8Array indexArray = 
new(ArrowArrayExtensions.WrapInArrayData(UInt8Type.Default, indicies, 
indicies.Length));}}
{{            return new DictionaryArray(new(UInt8Type.Default, 
StringType.Default, false), indexArray, valueArray.Build());}}
{{        }}}

as is the R

{{        library(arrow)}}
{{        foo = read_feather("test.feather")}}

If I drop the dictionary column the two Arrow implementations interop without 
difficulty. Same if I write only the indices as a UInt8 column. So the issue 
here is clearly specific to the use of DictionaryColumn. I've also tried other 
index sizes, so it doesn't appear specific to the use of UInt8.

I'm therefore left with two questions:

1) Does DictionaryArray have working use cases in 9.0.0?

2) If what I'm doing's not supposed to work yet, or I'm not getting the data 
structures set up correctly (there's no C# DictionaryArray example [on 
github|https://github.com/apache/arrow/tree/master/csharp/examples]), is there 
an array level workaround?

There's only one string table in this schema and it's typically tiny (five 
values or less) so putting its values part in the schema metadata is a viable 
workaround, albeit an inelegant one.

Not seeing that there's a feather file viewer available but, if there is, I'd 
be happy to take a closer look. Can also link the sources after they've been 
committed and pushed, which should be by the end of the day tomorrow.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17392) Disable anonymous namespaces in debug mode

2022-08-11 Thread Sasha Krassovsky (Jira)

Sasha Krassovsky created ARROW-17392:


 Summary: Disable anonymous namespaces in debug mode
 Key: ARROW-17392
 URL: https://issues.apache.org/jira/browse/ARROW-17392
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Sasha Krassovsky
Assignee: Sasha Krassovsky


I've had some pain points when using GDB and the pervasive use of anonymous 
namespaces throughout the code. I sent out an email on the mailing list and no 
one seemed to have any opinions, so I am opening this task. This task will gate 
anonymous namespaces around a `#ifndef NDEBUG` flag (or perhaps make a 
RELEASE_MODE_ANONYMOUS_NAMESPACE macro of some sort).

 

Mailing list discussion: 
https://lists.apache.org/thread/61rjzb18mvft7lpwglyh4kq2gkbog4ts



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17391) [C#] arrow::read_feather() cannot read DictionaryArray written from C#

2022-08-11 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-17391:

Summary: [C#] arrow::read_feather() cannot read DictionaryArray written 
from C#  (was: arrow::read_feather() cannot read DictionaryArray written from 
C#)

> [C#] arrow::read_feather() cannot read DictionaryArray written from C#
> --
>
> Key: ARROW-17391
> URL: https://issues.apache.org/jira/browse/ARROW-17391
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#, R
>Affects Versions: 9.0.1
>Reporter: Todd West
>Priority: Major
> Fix For: 9.0.1
>
>
> This applies to Arrow 9.0.0, both the C# nuget and R package, but for some 
> reason 9.0.0 isn't in the issue dropdowns' list of released versions. It also 
> appears the [implementation status 
> page|https://arrow.apache.org/docs/status.html#ipc-format] may be stale as 
> the C#  source contains 
> [DictionaryArray|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/DictionaryArray.cs]
>  and a look in the debugger confirms the flags flip and the data structures 
> update for 
> [ArrowStreamWriter|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs]
>  having correctly received both the dictionary index and value arrays it's 
> given on the code paths which write a [dictionary 
> batch|https://arrow.apache.org/docs/format/Columnar.html] . However, on the R 
> side, read_feather() fails with
> {{Error: Key error: Dictionary with id 1 not found}}
> So it appears most likely either C# isn't properly emitting the dictionary 
> batch, despite seeming to have all the code to do so, or something's going 
> wrong in the C++ layers under R in the reading side.
> Setup on the C# side is simple
> {{        public static DictionaryArray CreateStringTable(Memory 
> indicies, IList values)}}
> {{        {}}
> {{            StringArray.Builder valueArray = new();}}
> {{            for (int valueIndex = 0; valueIndex < values.Count; 
> ++valueIndex)}}
> {{            {}}
> {{                valueArray.Append(values[valueIndex]);}}
> {{            }}}{{            UInt8Array indexArray = 
> new(ArrowArrayExtensions.WrapInArrayData(UInt8Type.Default, indicies, 
> indicies.Length));}}
> {{            return new DictionaryArray(new(UInt8Type.Default, 
> StringType.Default, false), indexArray, valueArray.Build());}}
> {{        }}}
> as is the R
> {{        library(arrow)}}
> {{        foo = read_feather("test.feather")}}
> If I drop the dictionary column the two Arrow implementations interop without 
> difficulty. Same if I write only the indices as a UInt8 column. So the issue 
> here is clearly specific to the use of DictionaryColumn. I've also tried 
> other index sizes, so it doesn't appear specific to the use of UInt8.
> I'm therefore left with two questions:
> 1) Does DictionaryArray have working use cases in 9.0.0?
> 2) If what I'm doing's not supposed to work yet, or I'm not getting the data 
> structures set up correctly (there's no C# DictionaryArray example [on 
> github|https://github.com/apache/arrow/tree/master/csharp/examples]), is 
> there an array level workaround?
> There's only one string table in this schema and it's typically tiny (five 
> values or less) so putting its values part in the schema metadata is a viable 
> workaround, albeit an inelegant one.
> Not seeing that there's a feather file viewer available but, if there is, I'd 
> be happy to take a closer look. Can also link the sources after they've been 
> committed and pushed, which should be by the end of the day tomorrow.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17391) arrow::read_feather() cannot read DictionaryArray written from C#

2022-08-11 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578729#comment-17578729
 ] 

Neal Richardson commented on ARROW-17391:
-

DictionaryArray integration tests are skipped for C#, so this leads me to 
believe that whatever the C# library produces here is not compatible with the 
Arrow specification: 
https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L1651-L1652

> arrow::read_feather() cannot read DictionaryArray written from C#
> -
>
> Key: ARROW-17391
> URL: https://issues.apache.org/jira/browse/ARROW-17391
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#, R
>Affects Versions: 9.0.1
>Reporter: Todd West
>Priority: Major
> Fix For: 9.0.1
>
>
> This applies to Arrow 9.0.0, both the C# nuget and R package, but for some 
> reason 9.0.0 isn't in the issue dropdowns' list of released versions. It also 
> appears the [implementation status 
> page|https://arrow.apache.org/docs/status.html#ipc-format] may be stale as 
> the C#  source contains 
> [DictionaryArray|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/DictionaryArray.cs]
>  and a look in the debugger confirms the flags flip and the data structures 
> update for 
> [ArrowStreamWriter|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs]
>  having correctly received both the dictionary index and value arrays it's 
> given on the code paths which write a [dictionary 
> batch|https://arrow.apache.org/docs/format/Columnar.html] . However, on the R 
> side, read_feather() fails with
> {{Error: Key error: Dictionary with id 1 not found}}
> So it appears most likely either C# isn't properly emitting the dictionary 
> batch, despite seeming to have all the code to do so, or something's going 
> wrong in the C++ layers under R in the reading side.
> Setup on the C# side is simple
> {{        public static DictionaryArray CreateStringTable(Memory 
> indicies, IList values)}}
> {{        {}}
> {{            StringArray.Builder valueArray = new();}}
> {{            for (int valueIndex = 0; valueIndex < values.Count; 
> ++valueIndex)}}
> {{            {}}
> {{                valueArray.Append(values[valueIndex]);}}
> {{            }}}{{            UInt8Array indexArray = 
> new(ArrowArrayExtensions.WrapInArrayData(UInt8Type.Default, indicies, 
> indicies.Length));}}
> {{            return new DictionaryArray(new(UInt8Type.Default, 
> StringType.Default, false), indexArray, valueArray.Build());}}
> {{        }}}
> as is the R
> {{        library(arrow)}}
> {{        foo = read_feather("test.feather")}}
> If I drop the dictionary column the two Arrow implementations interop without 
> difficulty. Same if I write only the indices as a UInt8 column. So the issue 
> here is clearly specific to the use of DictionaryColumn. I've also tried 
> other index sizes, so it doesn't appear specific to the use of UInt8.
> I'm therefore left with two questions:
> 1) Does DictionaryArray have working use cases in 9.0.0?
> 2) If what I'm doing's not supposed to work yet, or I'm not getting the data 
> structures set up correctly (there's no C# DictionaryArray example [on 
> github|https://github.com/apache/arrow/tree/master/csharp/examples]), is 
> there an array level workaround?
> There's only one string table in this schema and it's typically tiny (five 
> values or less) so putting its values part in the schema metadata is a viable 
> workaround, albeit an inelegant one.
> Not seeing that there's a feather file viewer available but, if there is, I'd 
> be happy to take a closer look. Can also link the sources after they've been 
> committed and pushed, which should be by the end of the day tomorrow.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17393) pyarrow large integer conversion

2022-08-11 Thread Donald Freeman (Jira)

Donald Freeman created ARROW-17393:
--

 Summary: pyarrow large integer conversion
 Key: ARROW-17393
 URL: https://issues.apache.org/jira/browse/ARROW-17393
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Donald Freeman


I have a json document that looks like this. 

{"number": 123451234512345123451234512}

I then run the below code. 

>>> from pyarrow.json import read_json
>>> pyarrow_table = read_json('pyarrow_test.json')
>>> pyarrow_table['number'][0].as_py().as_integer_ratio()
(123451234512345125900779520, 1)

notice the float that I get looks like it has been rounded or modified in some 
way.

 

Am I reading this file incorrectly or is there an issue with the conversion of 
this number?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17394) [C++][Parquet] Fix parquet_static dependencies

2022-08-11 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-17394:


 Summary: [C++][Parquet] Fix parquet_static dependencies
 Key: ARROW-17394
 URL: https://issues.apache.org/jira/browse/ARROW-17394
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Parquet
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


See also: 
https://github.com/microsoft/vcpkg/issues/22552#issuecomment-1211341648

{quote}
I tried [#22552 
(comment)|https://github.com/microsoft/vcpkg/issues/22552#issuecomment-1093273619]
 (that is based on Apache Arrow 7.0.0).

The following dependencies are missed from link command line:

* {{vcpkg_installed/x64-linux/debug/lib/libthriftd.a}}
* {{vcpkg_installed/x64-linux/debug/lib/liblz4d.a}}
* {{vcpkg_installed/x64-linux/debug/lib/libbrotli\{enc,dec,common\}-static.a}}

It seems that {{thrift::thrift}}, {{lz4::lz4}} and 
{{unofficial::brotli::brotli\{enc,dec,common\}-static}} are missed in 
{{arrow_static}} dependencies:

{noformat}
$ grep INTERFACE_LINK_LIBRARIES 
vcpkg_installed/x64-linux/share/arrow/ArrowTargets.cmake 
  INTERFACE_LINK_LIBRARIES 
"OpenSSL::Crypto;OpenSSL::SSL;BZip2::BZip2;Snappy::snappy;ZLIB::ZLIB;zstd::libzstd_static;re2::re2;Threads::Threads;rt;\$"
{noformat}

I'm not sure that this is a problem in Apache Arrow's CMake configuration or 
patches in vcpkg.
{quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17394) [C++][Parquet] Fix parquet_static dependencies

2022-08-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17394:
---
Labels: pull-request-available  (was: )

> [C++][Parquet] Fix parquet_static dependencies
> --
>
> Key: ARROW-17394
> URL: https://issues.apache.org/jira/browse/ARROW-17394
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See also: 
> https://github.com/microsoft/vcpkg/issues/22552#issuecomment-1211341648
> {quote}
> I tried [#22552 
> (comment)|https://github.com/microsoft/vcpkg/issues/22552#issuecomment-1093273619]
>  (that is based on Apache Arrow 7.0.0).
> The following dependencies are missed from link command line:
> * {{vcpkg_installed/x64-linux/debug/lib/libthriftd.a}}
> * {{vcpkg_installed/x64-linux/debug/lib/liblz4d.a}}
> * {{vcpkg_installed/x64-linux/debug/lib/libbrotli\{enc,dec,common\}-static.a}}
> It seems that {{thrift::thrift}}, {{lz4::lz4}} and 
> {{unofficial::brotli::brotli\{enc,dec,common\}-static}} are missed in 
> {{arrow_static}} dependencies:
> {noformat}
> $ grep INTERFACE_LINK_LIBRARIES 
> vcpkg_installed/x64-linux/share/arrow/ArrowTargets.cmake 
>   INTERFACE_LINK_LIBRARIES 
> "OpenSSL::Crypto;OpenSSL::SSL;BZip2::BZip2;Snappy::snappy;ZLIB::ZLIB;zstd::libzstd_static;re2::re2;Threads::Threads;rt;\$"
> {noformat}
> I'm not sure that this is a problem in Apache Arrow's CMake configuration or 
> patches in vcpkg.
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-17383) RHEL8/Centos Repo has a bad repomd.xml file

2022-08-11 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-17383:


Assignee: Kouhei Sutou

> RHEL8/Centos Repo has a bad repomd.xml file
> ---
>
> Key: ARROW-17383
> URL: https://issues.apache.org/jira/browse/ARROW-17383
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Justin Gerry
>Assignee: Kouhei Sutou
>Priority: Blocker
>
> repomd.xml file is corrupted here:
> [https://apache.jfrog.io/ui/native/arrow/centos/8-stream/x86_64/repodata/repomd.xml]
> Can someone fix this so we can install packages from this repo again? 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17383) [Packaging] RHEL8/Centos Repo has a bad repomd.xml file

2022-08-11 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-17383:
-
Summary: [Packaging] RHEL8/Centos Repo has a bad repomd.xml file  (was: 
RHEL8/Centos Repo has a bad repomd.xml file)

> [Packaging] RHEL8/Centos Repo has a bad repomd.xml file
> ---
>
> Key: ARROW-17383
> URL: https://issues.apache.org/jira/browse/ARROW-17383
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Justin Gerry
>Assignee: Kouhei Sutou
>Priority: Blocker
>
> repomd.xml file is corrupted here:
> [https://apache.jfrog.io/ui/native/arrow/centos/8-stream/x86_64/repodata/repomd.xml]
> Can someone fix this so we can install packages from this repo again? 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17383) [Packaging] RHEL8/Centos Repo has a bad repomd.xml file

2022-08-11 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578758#comment-17578758
 ] 

Kouhei Sutou commented on ARROW-17383:
--

Why did you think the repository is broken?

> [Packaging] RHEL8/Centos Repo has a bad repomd.xml file
> ---
>
> Key: ARROW-17383
> URL: https://issues.apache.org/jira/browse/ARROW-17383
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Justin Gerry
>Assignee: Kouhei Sutou
>Priority: Blocker
>
> repomd.xml file is corrupted here:
> [https://apache.jfrog.io/ui/native/arrow/centos/8-stream/x86_64/repodata/repomd.xml]
> Can someone fix this so we can install packages from this repo again? 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17383) [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file

2022-08-11 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-17383:
-
Summary: [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file  (was: 
[Packaging] RHEL8/Centos Repo has a bad repomd.xml file)

> [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file
> --
>
> Key: ARROW-17383
> URL: https://issues.apache.org/jira/browse/ARROW-17383
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Justin Gerry
>Assignee: Kouhei Sutou
>Priority: Blocker
>
> repomd.xml file is corrupted here:
> [https://apache.jfrog.io/ui/native/arrow/centos/8-stream/x86_64/repodata/repomd.xml]
> Can someone fix this so we can install packages from this repo again? 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17383) [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file

2022-08-11 Thread Justin Gerry (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578762#comment-17578762
 ] 

Justin Gerry commented on ARROW-17383:
--

Try to use it with RHEL8 machine and yum says its corrupted. I also tried to 
mirror the repo in a couple of different ways and I get the same error. 

Can you regenerate this and/or test on a RHEL8 machine?

> [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file
> --
>
> Key: ARROW-17383
> URL: https://issues.apache.org/jira/browse/ARROW-17383
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Justin Gerry
>Assignee: Kouhei Sutou
>Priority: Blocker
>
> repomd.xml file is corrupted here:
> [https://apache.jfrog.io/ui/native/arrow/centos/8-stream/x86_64/repodata/repomd.xml]
> Can someone fix this so we can install packages from this repo again? 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17383) [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file

2022-08-11 Thread Justin Gerry (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578764#comment-17578764
 ] 

Justin Gerry commented on ARROW-17383:
--

There were no issues prior to Aug 3rd (I think) which coincides with the 9.0 
release which would make sense if the repo was updated. 

> [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file
> --
>
> Key: ARROW-17383
> URL: https://issues.apache.org/jira/browse/ARROW-17383
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Justin Gerry
>Assignee: Kouhei Sutou
>Priority: Blocker
>
> repomd.xml file is corrupted here:
> [https://apache.jfrog.io/ui/native/arrow/centos/8-stream/x86_64/repodata/repomd.xml]
> Can someone fix this so we can install packages from this repo again? 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17383) [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file

2022-08-11 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578765#comment-17578765
 ] 

Kouhei Sutou commented on ARROW-17383:
--

Could you show user command lines and their outputs?

I don't have RHEL 8 for now. I tried it with CentOS Stream 8 with the 
instruction in https://arrow.apache.org/install/ but I could install 
{{arrow-devel}}.

BTW, 
https://apache.jfrog.io/ui/native/arrow/centos/8-stream/x86_64/repodata/repomd.xml
 isn't for RHEL 8. 
https://apache.jfrog.io/ui/native/arrow/almalinux/8/x86_64/repodata/repomd.xml 
is for RHEL 8. 

> [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file
> --
>
> Key: ARROW-17383
> URL: https://issues.apache.org/jira/browse/ARROW-17383
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Justin Gerry
>Assignee: Kouhei Sutou
>Priority: Blocker
>
> repomd.xml file is corrupted here:
> [https://apache.jfrog.io/ui/native/arrow/centos/8-stream/x86_64/repodata/repomd.xml]
> Can someone fix this so we can install packages from this repo again? 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-17383) [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file

2022-08-11 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578765#comment-17578765
 ] 

Kouhei Sutou edited comment on ARROW-17383 at 8/12/22 3:12 AM:
---

Could you show yourcommand lines you tried and their outputs?

I don't have RHEL 8 for now. I tried it with CentOS Stream 8 with the 
instruction in https://arrow.apache.org/install/ but I could install 
{{arrow-devel}}.

BTW, 
https://apache.jfrog.io/ui/native/arrow/centos/8-stream/x86_64/repodata/repomd.xml
 isn't for RHEL 8. 
https://apache.jfrog.io/ui/native/arrow/almalinux/8/x86_64/repodata/repomd.xml 
is for RHEL 8. 


was (Author: kou):
Could you show user command lines and their outputs?

I don't have RHEL 8 for now. I tried it with CentOS Stream 8 with the 
instruction in https://arrow.apache.org/install/ but I could install 
{{arrow-devel}}.

BTW, 
https://apache.jfrog.io/ui/native/arrow/centos/8-stream/x86_64/repodata/repomd.xml
 isn't for RHEL 8. 
https://apache.jfrog.io/ui/native/arrow/almalinux/8/x86_64/repodata/repomd.xml 
is for RHEL 8. 

> [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file
> --
>
> Key: ARROW-17383
> URL: https://issues.apache.org/jira/browse/ARROW-17383
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Justin Gerry
>Assignee: Kouhei Sutou
>Priority: Blocker
>
> repomd.xml file is corrupted here:
> [https://apache.jfrog.io/ui/native/arrow/centos/8-stream/x86_64/repodata/repomd.xml]
> Can someone fix this so we can install packages from this repo again? 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17358) [CI][C++] Add a job for Alpine Linux

2022-08-11 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-17358.
--
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13825
[https://github.com/apache/arrow/pull/13825]

> [CI][C++] Add a job for Alpine Linux
> 
>
> Key: ARROW-17358
> URL: https://issues.apache.org/jira/browse/ARROW-17358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17383) [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file

2022-08-11 Thread Justin Gerry (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578777#comment-17578777
 ] 

Justin Gerry commented on ARROW-17383:
--

That's better! Note that the UI native link won't work for the OS itself.

Since you are using Artifactory the correct yum url is: 
[https://apache.jfrog.io/artifactory/arrow/almalinux/8]

Can you add notes regarding RHEL8 compatibility to 
[https://arrow.apache.org/release/9.0.0.html] or 
[https://arrow.apache.org/release/8.0.0.html] to use Almalinux instead? 

Thanks for your help.

> [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file
> --
>
> Key: ARROW-17383
> URL: https://issues.apache.org/jira/browse/ARROW-17383
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Justin Gerry
>Assignee: Kouhei Sutou
>Priority: Blocker
>
> repomd.xml file is corrupted here:
> [https://apache.jfrog.io/ui/native/arrow/centos/8-stream/x86_64/repodata/repomd.xml]
> Can someone fix this so we can install packages from this repo again? 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17383) [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file

2022-08-11 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-17383.
--
Resolution: Not A Problem

I don't think that we need to add notes to release notes because our install 
page https://arrow.apache.org/install/ already says "AlmaLinux 8 and Red Hat 
Enterprise Linux 8".

> [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file
> --
>
> Key: ARROW-17383
> URL: https://issues.apache.org/jira/browse/ARROW-17383
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Justin Gerry
>Assignee: Kouhei Sutou
>Priority: Blocker
>
> repomd.xml file is corrupted here:
> [https://apache.jfrog.io/ui/native/arrow/centos/8-stream/x86_64/repodata/repomd.xml]
> Can someone fix this so we can install packages from this repo again? 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17357) [CI][Conan] Enable JSON

2022-08-11 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-17357.
--
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13823
[https://github.com/apache/arrow/pull/13823]

> [CI][Conan] Enable JSON
> ---
>
> Key: ARROW-17357
> URL: https://issues.apache.org/jira/browse/ARROW-17357
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17367) [C++] Fix the LZ4's CMake target name

2022-08-11 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-17367.
--
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13831
[https://github.com/apache/arrow/pull/13831]

> [C++] Fix the LZ4's CMake target name
> -
>
> Key: ARROW-17367
> URL: https://issues.apache.org/jira/browse/ARROW-17367
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I changed the LZ4's CMake target name to {{lz4::lz4}} from {{LZ4::lz4}} by 
> ARROW-16614 because I thought that the official LZ4's CMake package uses 
> {{lz4::lz4}} not {{LZ4::lz4}}.
> But I was wrong. The official LZ4's CMake package uses {{LZ4::lz4}} not 
> {{lz4::lz4}}: 
> https://github.com/lz4/lz4/commit/f9378137ed306ea71d1d9cd60e1acf7b8e57f0df



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17370) [C++] Add limit to SplitString()

2022-08-11 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-17370.
--
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13833
[https://github.com/apache/arrow/pull/13833]

> [C++] Add limit to SplitString()
> 
>
> Key: ARROW-17370
> URL: https://issues.apache.org/jira/browse/ARROW-17370
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17395) [CI][Conan] can't find grpc-proto/cci.20220627 package

2022-08-11 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-17395:


 Summary: [CI][Conan] can't find grpc-proto/cci.20220627 package
 Key: ARROW-17395
 URL: https://issues.apache.org/jira/browse/ARROW-17395
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


https://github.com/ursacomputing/crossbow/runs/7783226405?check_suite_focus=true

{noformat}
WARN: thrift/0.16.0: requirement openssl/1.1.1o overridden by arrow/10.0.0 to 
openssl/1.1.1q 
WARN: grpc/1.47.0: requirement openssl/1.1.1o overridden by arrow/10.0.0 to 
openssl/1.1.1q 
WARN: grpc-proto/cci.20220627: requirement googleapis/cci.20220711 overridden 
by grpc/1.47.0 to googleapis/cci.20220531 
ERROR: Missing binary: 
grpc-proto/cci.20220627:a009d554471614a67005f24fdcb37541daece7cb
grpc-proto/cci.20220627: WARN: Can't find a 'grpc-proto/cci.20220627' package 
for the specified settings, options and dependencies:
- Settings: arch=x86_64, build_type=Release, compiler=gcc, 
compiler.libcxx=libstdc++, compiler.version=10, os=Linux
- Options: fPIC=True, shared=False, googleapis:fPIC=True, 
googleapis:shared=False, protobuf:debug_suffix=True, protobuf:fPIC=True, 
protobuf:lite=False, protobuf:shared=False, protobuf:with_rtti=True, 
protobuf:with_zlib=True, zlib:fPIC=True, zlib:shared=False
- Dependencies: protobuf/3.21.1, googleapis/cci.20220531
- Requirements: googleapis/cci.20220531, 
protobuf/3.21.1:37dd8aae630726607d9d4108fefd2f59c8f7e9db
- Package ID: a009d554471614a67005f24fdcb37541daece7cb
{noformat}

grpc-proto/cci is updated but grpc isn't updated yet. So we can't use 
grpc-proto/cci's pre-built binary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17395) [CI][Conan] can't find grpc-proto/cci.20220627 package

2022-08-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17395:
---
Labels: pull-request-available  (was: )

> [CI][Conan] can't find grpc-proto/cci.20220627 package
> --
>
> Key: ARROW-17395
> URL: https://issues.apache.org/jira/browse/ARROW-17395
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/ursacomputing/crossbow/runs/7783226405?check_suite_focus=true
> {noformat}
> WARN: thrift/0.16.0: requirement openssl/1.1.1o overridden by arrow/10.0.0 to 
> openssl/1.1.1q 
> WARN: grpc/1.47.0: requirement openssl/1.1.1o overridden by arrow/10.0.0 to 
> openssl/1.1.1q 
> WARN: grpc-proto/cci.20220627: requirement googleapis/cci.20220711 overridden 
> by grpc/1.47.0 to googleapis/cci.20220531 
> ERROR: Missing binary: 
> grpc-proto/cci.20220627:a009d554471614a67005f24fdcb37541daece7cb
> grpc-proto/cci.20220627: WARN: Can't find a 'grpc-proto/cci.20220627' package 
> for the specified settings, options and dependencies:
> - Settings: arch=x86_64, build_type=Release, compiler=gcc, 
> compiler.libcxx=libstdc++, compiler.version=10, os=Linux
> - Options: fPIC=True, shared=False, googleapis:fPIC=True, 
> googleapis:shared=False, protobuf:debug_suffix=True, protobuf:fPIC=True, 
> protobuf:lite=False, protobuf:shared=False, protobuf:with_rtti=True, 
> protobuf:with_zlib=True, zlib:fPIC=True, zlib:shared=False
> - Dependencies: protobuf/3.21.1, googleapis/cci.20220531
> - Requirements: googleapis/cci.20220531, 
> protobuf/3.21.1:37dd8aae630726607d9d4108fefd2f59c8f7e9db
> - Package ID: a009d554471614a67005f24fdcb37541daece7cb
> {noformat}
> grpc-proto/cci is updated but grpc isn't updated yet. So we can't use 
> grpc-proto/cci's pre-built binary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17383) [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file

2022-08-11 Thread Justin Gerry (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578805#comment-17578805
 ] 

Justin Gerry commented on ARROW-17383:
--

Its easy to end up missing that note if you start at the pages I referenced. I 
see now though. You can close this ticket.  Thanks for your help

> [Packaging] CentOS Stream 8 Repo has a bad repomd.xml file
> --
>
> Key: ARROW-17383
> URL: https://issues.apache.org/jira/browse/ARROW-17383
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Justin Gerry
>Assignee: Kouhei Sutou
>Priority: Blocker
>
> repomd.xml file is corrupted here:
> [https://apache.jfrog.io/ui/native/arrow/centos/8-stream/x86_64/repodata/repomd.xml]
> Can someone fix this so we can install packages from this repo again? 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

95 matches

Mail list logo