[jira] [Created] (ARROW-16514) [Website] Update install page for 8.0.0
Kouhei Sutou created ARROW-16514: Summary: [Website] Update install page for 8.0.0 Key: ARROW-16514 URL: https://issues.apache.org/jira/browse/ARROW-16514 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16513) [C++] Add a compute function to hash inputs
Weston Pace created ARROW-16513: --- Summary: [C++] Add a compute function to hash inputs Key: ARROW-16513 URL: https://issues.apache.org/jira/browse/ARROW-16513 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Weston Pace We have a lot of internal logic for hashing inputs and it might be nice to expose some of this to users (e.g. https://stackoverflow.com/questions/72177022/how-to-get-hash-of-string-column-in-polars-or-pyarrow) The `HashBatch` method in `key_hash.h` (not quite merged but close) is likely to be the most performant. However, it does make some sacrifices on uniqueness of hashes in the spirit of performance (so we should make sure to document these). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16512) [C++] Support nested custom output field names in Substrait
Weston Pace created ARROW-16512: --- Summary: [C++] Support nested custom output field names in Substrait Key: ARROW-16512 URL: https://issues.apache.org/jira/browse/ARROW-16512 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Weston Pace ARROW-15901 added initial support {{RelRoot::names}} which assigns names to the output. We still need to add support for struct columns. {{RelRoot::names}} should be a DFS ordered list of names that includes the names of any nested fields. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16511) [R] Preserve schema metadata in write_dataset()
Neal Richardson created ARROW-16511: --- Summary: [R] Preserve schema metadata in write_dataset() Key: ARROW-16511 URL: https://issues.apache.org/jira/browse/ARROW-16511 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 9.0.0, 8.0.1 When we moved to using ExecPlans instead of Scanner, the metadata from the input table was dropped. We preserved the R metadata but not anything else. It turned out that {{sfarrow}} was relying on extra metadata, and this caused reverse dependency failures in the 8.0.0 release. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16510) [R] Add bindings for GCS filesystem
Will Jones created ARROW-16510: -- Summary: [R] Add bindings for GCS filesystem Key: ARROW-16510 URL: https://issues.apache.org/jira/browse/ARROW-16510 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 8.0.0 Reporter: Will Jones Fix For: 9.0.0 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16509) [R][Docs] Update dataset vignette
Will Jones created ARROW-16509: -- Summary: [R][Docs] Update dataset vignette Key: ARROW-16509 URL: https://issues.apache.org/jira/browse/ARROW-16509 Project: Apache Arrow Issue Type: Improvement Components: Documentation, R Affects Versions: 8.0.0 Reporter: Will Jones Fix For: 9.0.0 Since the dataset vignette was written, we've added join, aggregation, and distinct support (and soon union/union_all support). The dataset vignette currently says we don't support those operations. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16508) [Archery][DevTools] Allow specific success or failure message to be sent on chat report
Raúl Cumplido created ARROW-16508: - Summary: [Archery][DevTools] Allow specific success or failure message to be sent on chat report Key: ARROW-16508 URL: https://issues.apache.org/jira/browse/ARROW-16508 Project: Apache Arrow Issue Type: Sub-task Reporter: Raúl Cumplido Assignee: Raúl Cumplido Fix For: 9.0.0 Feature requested to be able to extend the chat report message based on success or failure of jobs. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16507) [CI][C++] Use system gtest with numba/conda
Jacob Wujciak-Jens created ARROW-16507: -- Summary: [CI][C++] Use system gtest with numba/conda Key: ARROW-16507 URL: https://issues.apache.org/jira/browse/ARROW-16507 Project: Apache Arrow Issue Type: Bug Components: C++, Continuous Integration Reporter: Jacob Wujciak-Jens Assignee: Jacob Wujciak-Jens Fix For: 9.0.0 With the change in ARROW-1490 removal of gtest is not needed and breaks the build. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16506) Pyarrow 8.0.0 write_dataset writes data in different order with {{use_threads=True}}
Daniel Friar created ARROW-16506: Summary: Pyarrow 8.0.0 write_dataset writes data in different order with {{use_threads=True}} Key: ARROW-16506 URL: https://issues.apache.org/jira/browse/ARROW-16506 Project: Apache Arrow Issue Type: Bug Reporter: Daniel Friar In the latest (8.0.0) release the following code snippet seems to write out data in a different order for each of the partitions when {{use_threads=True}} vs when {{{}use_threads=False{}}}. Testing the same snippet with pyarrow gives the same order regardless of whether {{use_threads}} is set to True when the data is writen. {code:java} import itertools import numpy as np import pyarrow.dataset as ds import pyarrow as pa n_rows, n_cols = 100_000, 20 def create_dataframe(color, year): arr = np.random.randn(n_rows, n_cols) df = pd.DataFrame(data=arr, columns=[f"column_{i}" for i in range(n_cols)]) df["color"] = color df["year"] = year df["id"] = np.arange(len(df)) return df partitions = ["red", "green", "blue"] years = [2011, 2012, 2013] dataframes = [create_dataframe(p, y) for p, y in itertools.product(partitions, years)] df = pd.concat(dataframes) table = pa.Table.from_pandas(df=df) ds.write_dataset( table, "./test", format="parquet", max_rows_per_group=1_000_000, min_rows_per_group=1_000_000, existing_data_behavior="overwrite_or_ignore", partitioning=ds.partitioning(pa.schema([ ("color", pa.string()), ("year", pa.int64()) ]), flavor="hive"), use_threads=True, ) df_read = pd.read_parquet("./test/color=blue/year=2012") df_read.head()[["id"]] {code} Tested on Ubuntu 20.04 with Python 3.8 and arrow versions 8.0.0 and 7.0.0. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16505) [Python][Parquet] Enable usage of external key material and rotation for encryption keys in PyArrow
Maya Anderson created ARROW-16505: - Summary: [Python][Parquet] Enable usage of external key material and rotation for encryption keys in PyArrow Key: ARROW-16505 URL: https://issues.apache.org/jira/browse/ARROW-16505 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Maya Anderson Python API wrapper for ARROW-9960 . -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16504) [Go][CSV] Add arrow.TimestampType support to the reader
Mark Wolfe created ARROW-16504: -- Summary: [Go][CSV] Add arrow.TimestampType support to the reader Key: ARROW-16504 URL: https://issues.apache.org/jira/browse/ARROW-16504 Project: Apache Arrow Issue Type: Improvement Components: Go Affects Versions: 8.0.0 Reporter: Mark Wolfe There is already a helper to convert strings to arrow.Timestamp so incorporate this into the CSV reader. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16503) [C++] Can't concatenate extension arrays
Dewey Dunnington created ARROW-16503: Summary: [C++] Can't concatenate extension arrays Key: ARROW-16503 URL: https://issues.apache.org/jira/browse/ARROW-16503 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Dewey Dunnington It looks like Arrays with an extension type can't be concatenated. From the R bindings: {code:R} library(arrow, warn.conflicts = FALSE) arr <- vctrs_extension_array(1:10) concat_arrays(arr, arr) #> Error: NotImplemented: concatenation of integer(0) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195 VisitTypeInline(*out_->type, this) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590 ConcatenateImpl(data, pool).Concatenate(_data) {code} This shows up more practically when using the query engine: {code:R} library(arrow, warn.conflicts = FALSE) table <- arrow_table( group = rep(c("a", "b"), 5), col1 = 1:10, col2 = vctrs_extension_array(1:10) ) tf <- tempfile() table |> dplyr::group_by(group) |> write_dataset(tf) open_dataset(tf) |> dplyr::arrange(col1) |> dplyr::collect() #> Error in `dplyr::collect()`: #> ! NotImplemented: concatenation of extension #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195 VisitTypeInline(*out_->type, this) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590 ConcatenateImpl(data, pool).Concatenate(_data) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2025 Concatenate(values.chunks(), ctx->memory_pool()) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2084 TakeCA(*table.column(j), indices, options, ctx) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/sink_node.cc:527 impl_->DoFinish() #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:467 iterator_.Next() #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337 ReadNext() #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:351 ToRecordBatches() {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16502) StructBuilder UnmarshalJSON does not handle missing optional fields
Przemysław Kowolik created ARROW-16502: -- Summary: StructBuilder UnmarshalJSON does not handle missing optional fields Key: ARROW-16502 URL: https://issues.apache.org/jira/browse/ARROW-16502 Project: Apache Arrow Issue Type: Bug Components: Go Affects Versions: 8.0.0 Reporter: Przemysław Kowolik When calling array.StructBuilder.UnmarshalJSON with a JSON object that has missing optional fields, it fails to decode the JSON object properly and will panic - but it's a common behavior to drop empty/null fields from the JSON -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16501) [Docs][C++][R] Migrate to Matomo from Google Analytics
Kouhei Sutou created ARROW-16501: Summary: [Docs][C++][R] Migrate to Matomo from Google Analytics Key: ARROW-16501 URL: https://issues.apache.org/jira/browse/ARROW-16501 Project: Apache Arrow Issue Type: Sub-task Components: C++, Documentation, R Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.20.7#820007)