[jira] [Created] (ARROW-18378) MIGRATION: Disable issue reporting in ASF Jira

2022-11-21 Thread Todd Farmer (Jira)
Todd Farmer created ARROW-18378:
---

 Summary: MIGRATION: Disable issue reporting in ASF Jira
 Key: ARROW-18378
 URL: https://issues.apache.org/jira/browse/ARROW-18378
 Project: Apache Arrow
  Issue Type: Task
Reporter: Todd Farmer


ARROW-18364 enabled issue reporting for Apache Arrow in GitHub issues. Even 
though existing Jira issues have not yet been migrated and are still being 
worked in the Jira system, we should assess disabling creation of new issues in 
ASF Jira, and instead pointing users to GitHub issues. This may benefit the 
project by reducing the need to monitor inflow in two discrete systems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18377) MIGRATION: Automate component labels from issue form content

2022-11-21 Thread Todd Farmer (Jira)
Todd Farmer created ARROW-18377:
---

 Summary: MIGRATION: Automate component labels from issue form 
content
 Key: ARROW-18377
 URL: https://issues.apache.org/jira/browse/ARROW-18377
 Project: Apache Arrow
  Issue Type: Task
Reporter: Todd Farmer


ARROW-18364 added the ability to report issues in GitHub, and includes GitHub 
issue templates with a drop-down component(s) selector. These form elements 
drive resulting issue markdown only, and cannot dynamically drive issue labels. 
This requires GitHub actions, which also have a few limitations. First, the 
issue form does not produce any structured data, it only produces the issue 
description markdown, so a parser is required. Second, ASF restricts GitHub 
actions to a selection of approved actions. It is likely that while community 
actions exist to generate structured data from issue forms, the Apache Arrow 
project will need to write its own parser and label application action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18376) MIGRATION: Add component labels to GitHub

2022-11-21 Thread Todd Farmer (Jira)
Todd Farmer created ARROW-18376:
---

 Summary: MIGRATION: Add component labels to GitHub
 Key: ARROW-18376
 URL: https://issues.apache.org/jira/browse/ARROW-18376
 Project: Apache Arrow
  Issue Type: Task
Reporter: Todd Farmer


Similar to ARROW-18375, component labels have been established based on 
existing component values defined in ASF Jira. The following labels are needed:

* Component: Archery
* Component: Benchmarking
* Component: C
* Component: C#
* Component: C++
* Component: C++ - Gandiva
* Component: C++ - Plasma
* Component: Continuous Integration
* Component: Dart
* Component: Developer Tools
* Component: Documentation
* Component: FlightRPC
* Component: Format
* Component: GLib
* Component: Go
* Component: GPU
* Component: Integration
* Component: Java
* Component: JavaScript
* Component: MATLAB
* Component: Packaging
* Component: Parquet
* Component: Python
* Component: R
* Component: Ruby
* Component: Swift
* Component: Website
* Component: Other



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18375) MIGRATION: Enable GitHub issue type labels

2022-11-21 Thread Todd Farmer (Jira)
Todd Farmer created ARROW-18375:
---

 Summary: MIGRATION: Enable GitHub issue type labels
 Key: ARROW-18375
 URL: https://issues.apache.org/jira/browse/ARROW-18375
 Project: Apache Arrow
  Issue Type: Task
Reporter: Todd Farmer


As part of enabling GitHub issue reporting, the following labels have been 
defined and need to be added to the repository label options. Without these 
labels added, [new issues|https://github.com/apache/arrow/issues/14692] do not 
get the issue template-defined issue type labels set properly.

 

Labels:
 * Type: bug
 * Type: enhancement
 * Type: usage

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18374) [Go][CI][Benchmarks] Fix Go Bench Script after conbench change

2022-11-21 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-18374:
-

 Summary: [Go][CI][Benchmarks] Fix Go Bench Script after conbench 
change
 Key: ARROW-18374
 URL: https://issues.apache.org/jira/browse/ARROW-18374
 Project: Apache Arrow
  Issue Type: Bug
  Components: Benchmarking, Continuous Integration, Go
Reporter: Matthew Topol
Assignee: Matthew Topol


Change [https://github.com/conbench/conbench/pull/417/files#] requires now 
putting an explicit {{github=None}} as an argument to {{BenchmarkResult}} to 
have it get the github info from the locally cloned repo.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18373) MIGRATION: Enable multiple component selection in issue templates

2022-11-21 Thread Todd Farmer (Jira)
Todd Farmer created ARROW-18373:
---

 Summary: MIGRATION: Enable multiple component selection in issue 
templates
 Key: ARROW-18373
 URL: https://issues.apache.org/jira/browse/ARROW-18373
 Project: Apache Arrow
  Issue Type: Task
Reporter: Todd Farmer


Per comments in [this merged PR|https://github.com/apache/arrow/pull/14675], we 
would like to enable selection of multiple components when reporting issues via 
GitHub issues.

Additionally, we may want to add the needed Apache license to the issue 
templates and remove the exclusion rules from rat_exclude_files.txt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18372) [R] "Error in `collect()`: ! Invalid: negative malloc size" after large computation returning one cell

2022-11-21 Thread Lucas Mation (Jira)
Lucas Mation created ARROW-18372:


 Summary: [R] "Error in `collect()`: ! Invalid: negative malloc 
size" after large computation returning one cell
 Key: ARROW-18372
 URL: https://issues.apache.org/jira/browse/ARROW-18372
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 10.0.0
Reporter: Lucas Mation


I have a large parquet file 900 million rows , 40cols parquet file, subdivided 
into folders for each year. I was trying to calculate how many unique 
combinations of id1+id2+id3+id4 there are in the dataset.

 

Notice that the "collected" dataset is supposed to be only one row and one cel, 
containing the count (I've confirmed this by subseting the dataset ("%>% 
head(10^6)" ) before computing the count, and it works). That is why the error 
below is so weird

```

fa <- 'myparteq folder' #huge 

va <- open_dataset(fa)

tic()
d <- va  %>% head(10^6) %>% count(id1,id2,id3,id4) %>% count %>% collect

toc()

 

Error in `collect()`:
! Invalid: negative malloc size
Run `rlang::last_error()` to see where the error occurred.

 

> rlang::last_error()

Error in `collect()`:
! Invalid: negative malloc size
---
Backtrace:
 1. ... %>% collect
 3. arrow:::collect.arrow_dplyr_query(.)
Run `rlang::last_trace()` to see the full context.

 

> rlang::last_trace()

Error in `collect()`:
! Invalid: negative malloc size
---
Backtrace:
    x
 1. +-... %>% collect
 2. +-dplyr::collect(.)
 3. \-arrow:::collect.arrow_dplyr_query(.)
 4.   \-base::tryCatch(...)
 5.     \-base (local) tryCatchList(expr, classes, parentenv, handlers)
 6.       \-base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 7.         \-value[[3L]](cond)
 8.           \-arrow:::augment_io_error_msg(e, call, schema = x$.data$schema)
 9.             \-rlang::abort(msg, call = call)

 

```

I am running this on a windows server, 512Gb of RAM.

 sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 R2 x64 (build 9600)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252    
LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] arrow_10.0.0      data.table_1.14.4 forcats_0.5.2     dplyr_1.0.10      
purrr_0.3.5  readr_2.1.3       tidyr_1.2.1       tibble_3.1.8     
 [9] ggplot2_3.3.6     tidyverse_1.3.2   gt_0.7.0          xtable_1.8-4      
ggthemes_4.2.4    collapse_1.8.6    pryr_0.1.5        janitor_2.1.0    
[17] tictoc_1.1        lubridate_1.8.0   stringr_1.4.1     readxl_1.4.1     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9          assertthat_0.2.1    digest_0.6.30       utf8_1.2.2     
     R6_2.5.1            cellranger_1.1.0    backports_1.4.1    
 [8] reprex_2.0.2        httr_1.4.4          pillar_1.8.1        rlang_1.0.6    
     googlesheets4_1.0.1 rstudioapi_0.14     googledrive_2.0.0  
[15] bit_4.0.4           munsell_0.5.0       broom_1.0.1         compiler_4.2.1 
     modelr_0.1.9        pkgconfig_2.0.3     htmltools_0.5.3    
[22] tidyselect_1.2.0    codetools_0.2-18    fansi_1.0.3         crayon_1.5.2   
     tzdb_0.3.0          dbplyr_2.2.1        withr_2.5.0        
[29] grid_4.2.1          jsonlite_1.8.3      gtable_0.3.1        
lifecycle_1.0.3     DBI_1.1.3           magrittr_2.0.3      scales_1.2.1       
[36] cli_3.4.1           stringi_1.7.8       fs_1.5.2            
snakecase_0.11.0    xml2_1.3.3          ellipsis_0.3.2      generics_0.1.3     
[43] vctrs_0.5.0         tools_4.2.1         bit64_4.0.5         glue_1.6.2     
     hms_1.1.2           parallel_4.2.1      fastmap_1.1.0      
[50] colorspace_2.0-3    gargle_1.2.1        rvest_1.0.3         haven_2.5.1    

 

 arrow_info()
Arrow package version: 10.0.0

Capabilities:
               
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
gcs        TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc  FALSE
mimalloc   TRUE

Arrow options():
                       
arrow.use_threads FALSE

Memory:
                  
Allocator mimalloc
Current   74.82 Gb
Max       97.75 Gb

Runtime:
                        
SIMD Level          avx2
Detected SIMD Level avx2

Build:
                                                             
C++ Library Version                                    10.0.0
C++ Compiler                                              GNU
C++ Compiler Version                                   10.3.0
Git ID               aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18371) [C++] Expose *FromJSON helpers

2022-11-21 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-18371:
--

 Summary: [C++] Expose *FromJSON helpers
 Key: ARROW-18371
 URL: https://issues.apache.org/jira/browse/ARROW-18371
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Rok Mihevc


{Array,{{Exec,Record}Batch}FromJSON helper functions would be useful when 
testing in projects that use Arrow. BatchesWithSchema and MakeBasicBatches 
could be considered as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)