[jira] [Created] (ARROW-16514) [Website] Update install page for 8.0.0

2022-05-09 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-16514:


 Summary: [Website] Update install page for 8.0.0
 Key: ARROW-16514
 URL: https://issues.apache.org/jira/browse/ARROW-16514
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16513) [C++] Add a compute function to hash inputs

2022-05-09 Thread Weston Pace (Jira)
Weston Pace created ARROW-16513:
---

 Summary: [C++] Add a compute function to hash inputs
 Key: ARROW-16513
 URL: https://issues.apache.org/jira/browse/ARROW-16513
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace


We have a lot of internal logic for hashing inputs and it might be nice to 
expose some of this to users (e.g. 
https://stackoverflow.com/questions/72177022/how-to-get-hash-of-string-column-in-polars-or-pyarrow)

The `HashBatch` method in `key_hash.h` (not quite merged but close) is likely 
to be the most performant.  However, it does make some sacrifices on uniqueness 
of hashes in the spirit of performance (so we should make sure to document 
these).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16512) [C++] Support nested custom output field names in Substrait

2022-05-09 Thread Weston Pace (Jira)
Weston Pace created ARROW-16512:
---

 Summary: [C++] Support nested custom output field names in 
Substrait
 Key: ARROW-16512
 URL: https://issues.apache.org/jira/browse/ARROW-16512
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace


ARROW-15901 added initial support {{RelRoot::names}} which assigns names to the 
output.

We still need to add support for struct columns.  {{RelRoot::names}} should be 
a DFS ordered list of names that includes the names of any nested fields.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16511) [R] Preserve schema metadata in write_dataset()

2022-05-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-16511:
---

 Summary: [R] Preserve schema metadata in write_dataset()
 Key: ARROW-16511
 URL: https://issues.apache.org/jira/browse/ARROW-16511
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 9.0.0, 8.0.1


When we moved to using ExecPlans instead of Scanner, the metadata from the 
input table was dropped. We preserved the R metadata but not anything else. It 
turned out that {{sfarrow}} was relying on extra metadata, and this caused 
reverse dependency failures in the 8.0.0 release.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16510) [R] Add bindings for GCS filesystem

2022-05-09 Thread Will Jones (Jira)
Will Jones created ARROW-16510:
--

 Summary: [R] Add bindings for GCS filesystem
 Key: ARROW-16510
 URL: https://issues.apache.org/jira/browse/ARROW-16510
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 8.0.0
Reporter: Will Jones
 Fix For: 9.0.0






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16509) [R][Docs] Update dataset vignette

2022-05-09 Thread Will Jones (Jira)
Will Jones created ARROW-16509:
--

 Summary: [R][Docs] Update dataset vignette
 Key: ARROW-16509
 URL: https://issues.apache.org/jira/browse/ARROW-16509
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, R
Affects Versions: 8.0.0
Reporter: Will Jones
 Fix For: 9.0.0


Since the dataset vignette was written, we've added join, aggregation, and 
distinct support (and soon union/union_all support). The dataset vignette 
currently says we don't support those operations.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16508) [Archery][DevTools] Allow specific success or failure message to be sent on chat report

2022-05-09 Thread Jira
Raúl Cumplido created ARROW-16508:
-

 Summary: [Archery][DevTools] Allow specific success or failure 
message to be sent on chat report
 Key: ARROW-16508
 URL: https://issues.apache.org/jira/browse/ARROW-16508
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Raúl Cumplido
Assignee: Raúl Cumplido
 Fix For: 9.0.0


Feature requested to be able to extend the chat report message based on success 
or failure of jobs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16507) [CI][C++] Use system gtest with numba/conda

2022-05-09 Thread Jacob Wujciak-Jens (Jira)
Jacob Wujciak-Jens created ARROW-16507:
--

 Summary: [CI][C++] Use system gtest with numba/conda
 Key: ARROW-16507
 URL: https://issues.apache.org/jira/browse/ARROW-16507
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Reporter: Jacob Wujciak-Jens
Assignee: Jacob Wujciak-Jens
 Fix For: 9.0.0


With the change in ARROW-1490 removal of gtest is not needed and breaks the 
build.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16506) Pyarrow 8.0.0 write_dataset writes data in different order with {{use_threads=True}}

2022-05-09 Thread Daniel Friar (Jira)
Daniel Friar created ARROW-16506:


 Summary: Pyarrow 8.0.0 write_dataset writes data in different 
order with {{use_threads=True}}
 Key: ARROW-16506
 URL: https://issues.apache.org/jira/browse/ARROW-16506
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Daniel Friar


In the latest (8.0.0) release the following code snippet seems to write out 
data in a different order for each of the partitions when {{use_threads=True}} 
vs when {{{}use_threads=False{}}}.

Testing the same snippet with pyarrow gives the same order regardless of 
whether {{use_threads}} is set to True when the data is writen.

 
{code:java}
import itertools

import numpy as np
import pyarrow.dataset as ds
import pyarrow as pa

n_rows, n_cols = 100_000, 20

def create_dataframe(color, year):
arr = np.random.randn(n_rows, n_cols)
df = pd.DataFrame(data=arr, columns=[f"column_{i}" for i in range(n_cols)])
df["color"] = color
df["year"] = year
df["id"] = np.arange(len(df))
return df


partitions = ["red", "green", "blue"]
years = [2011, 2012, 2013]
dataframes = [create_dataframe(p, y) for p, y in itertools.product(partitions, 
years)]
df = pd.concat(dataframes)

table = pa.Table.from_pandas(df=df)

ds.write_dataset(
table,
"./test",
format="parquet",
max_rows_per_group=1_000_000,
min_rows_per_group=1_000_000,
existing_data_behavior="overwrite_or_ignore",
partitioning=ds.partitioning(pa.schema([
("color", pa.string()),
("year", pa.int64())
]), flavor="hive"),
use_threads=True,
)

df_read = pd.read_parquet("./test/color=blue/year=2012")
df_read.head()[["id"]]

{code}
 

Tested on Ubuntu 20.04 with Python 3.8 and arrow versions 8.0.0 and 7.0.0.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16505) [Python][Parquet] Enable usage of external key material and rotation for encryption keys in PyArrow

2022-05-09 Thread Maya Anderson (Jira)
Maya Anderson created ARROW-16505:
-

 Summary: [Python][Parquet] Enable usage of external key material 
and rotation for encryption keys in PyArrow
 Key: ARROW-16505
 URL: https://issues.apache.org/jira/browse/ARROW-16505
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Maya Anderson


Python API wrapper for ARROW-9960 .



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16504) [Go][CSV] Add arrow.TimestampType support to the reader

2022-05-09 Thread Mark Wolfe (Jira)
Mark Wolfe created ARROW-16504:
--

 Summary: [Go][CSV] Add arrow.TimestampType support to the reader
 Key: ARROW-16504
 URL: https://issues.apache.org/jira/browse/ARROW-16504
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Affects Versions: 8.0.0
Reporter: Mark Wolfe


There is already a helper to convert strings to arrow.Timestamp so incorporate 
this into the CSV reader.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16503) [C++] Can't concatenate extension arrays

2022-05-09 Thread Dewey Dunnington (Jira)
Dewey Dunnington created ARROW-16503:


 Summary: [C++] Can't concatenate extension arrays
 Key: ARROW-16503
 URL: https://issues.apache.org/jira/browse/ARROW-16503
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Dewey Dunnington


It looks like Arrays with an extension type can't be concatenated. From the R 
bindings:

{code:R}
library(arrow, warn.conflicts = FALSE)

arr <- vctrs_extension_array(1:10)
concat_arrays(arr, arr)
#> Error: NotImplemented: concatenation of integer(0)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195
  VisitTypeInline(*out_->type, this)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590
  ConcatenateImpl(data, pool).Concatenate(_data)
{code}

This shows up more practically when using the query engine:

{code:R}
library(arrow, warn.conflicts = FALSE)

table <- arrow_table(
  group = rep(c("a", "b"), 5),
  col1 = 1:10,
  col2 = vctrs_extension_array(1:10)
)

tf <- tempfile()
table |> dplyr::group_by(group) |> write_dataset(tf)
open_dataset(tf) |>
  dplyr::arrange(col1) |> 
  dplyr::collect()
#> Error in `dplyr::collect()`:
#> ! NotImplemented: concatenation of extension
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195
  VisitTypeInline(*out_->type, this)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590
  ConcatenateImpl(data, pool).Concatenate(_data)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2025
  Concatenate(values.chunks(), ctx->memory_pool())
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2084
  TakeCA(*table.column(j), indices, options, ctx)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/sink_node.cc:527
  impl_->DoFinish()
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:467
  iterator_.Next()
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337 
 ReadNext()
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:351 
 ToRecordBatches()
{code}




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16502) StructBuilder UnmarshalJSON does not handle missing optional fields

2022-05-09 Thread Jira
Przemysław Kowolik created ARROW-16502:
--

 Summary: StructBuilder UnmarshalJSON does not handle missing 
optional fields
 Key: ARROW-16502
 URL: https://issues.apache.org/jira/browse/ARROW-16502
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Affects Versions: 8.0.0
Reporter: Przemysław Kowolik


When calling array.StructBuilder.UnmarshalJSON with a JSON object that has 
missing optional fields, it fails to decode the JSON object properly and will 
panic - but it's a common behavior to drop empty/null fields from the JSON



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16501) [Docs][C++][R] Migrate to Matomo from Google Analytics

2022-05-09 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-16501:


 Summary: [Docs][C++][R] Migrate to Matomo from Google Analytics
 Key: ARROW-16501
 URL: https://issues.apache.org/jira/browse/ARROW-16501
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++, Documentation, R
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.20.7#820007)