[jira] [Created] (ARROW-16161) [C++] Overhead of std::shared_ptr copies is causing thread contention
Weston Pace created ARROW-16161: --- Summary: [C++] Overhead of std::shared_ptr copies is causing thread contention Key: ARROW-16161 URL: https://issues.apache.org/jira/browse/ARROW-16161 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Weston Pace We created a benchmark to measure ExecuteScalarExpression performance in ARROW-16014. We noticed significant thread contention (even though there shouldn't be much, if any, for this task) As part of ARROW-16138 we have been investigating possible causes. One cause seems to be contention from copying shared_ptr objects. Two possible solutions jump to mind and I'm sure there are many more. ExecBatch is an internal type and used inside of ExecuteScalarExpression as well as inside of the execution engine. In the former we can safely assume the data types will exist for the duration of the call. In the latter we can safely assume the data types will exist for the duration of the execution plan. Thus we can probably take a more targetted fix and migrate only ExecBatch to using DataType* (or const DataType&). On the other hand, we might consider a more global approach. All of our "stock" data types are assumed to have static storage duration. However, we must use std::shared_ptr because users could create their own extension types. We could invent an "extension type registration" system where extension types must first be registered with the C++ lib before being used. Then we could have long-lived DataType instances and we could replace std::shared_ptr with DataType* (or const DataType&) throughout most of the entire code base. But, as I mentioned, I'm sure there are many approaches to take. CC [~lidavidm] and [~apitrou] and [~yibocai] for thoughts but this might be interesting for just about any C++ dev. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16160) [C++] IPC Stream Reader doesn't check if extra fields are present for RecordBatches
Micah Kornfield created ARROW-16160: --- Summary: [C++] IPC Stream Reader doesn't check if extra fields are present for RecordBatches Key: ARROW-16160 URL: https://issues.apache.org/jira/browse/ARROW-16160 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Affects Versions: 6.0.1 Reporter: Micah Kornfield I looked through recent commits and I don't think this issue has been patched since: {code:title=test.python|borderStyle=solid} import pyarrow as pa with pa.output_stream("/tmp/f1") as sink: with pa.RecordBatchStreamWriter(sink, rb1.schema) as writer: writer.write(rb1) end_rb1 = sink.tell() with pa.output_stream("/tmp/f2") as sink: with pa.RecordBatchStreamWriter(sink, rb2.schema) as writer: writer.write(rb2) start_rb2_only = sink.tell() writer.write(rb2) end_rb2 = sink.tell() # Stitch to togher rb1.schema, rb1 and rb2 without schema. with pa.output_stream("/tmp/f3") as sink: with pa.input_stream("/tmp/f1") as inp: sink.write(inp.read(end_rb1)) with pa.input_stream("/tmp/f2") as inp: inp.seek(start_rb2_only) sink.write(inp.read(end_rb2 - start_rb2_only)) with pa.ipc.open_stream("/tmp/f3") as sink: print(sink.read_all()) {code} Yields: {code} {{pyarrow.Table c1: int64 c1: [[1],[1]] {code} I would expect this to error because the second stiched in record batch has more fields then necessary but it appears to load just fine. Is this intended behavior? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16159) [C++] Allow FileSystem::DeleteDirContents to succeed if the directory is missing
Weston Pace created ARROW-16159: --- Summary: [C++] Allow FileSystem::DeleteDirContents to succeed if the directory is missing Key: ARROW-16159 URL: https://issues.apache.org/jira/browse/ARROW-16159 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Weston Pace Assignee: Weston Pace Currently DeleteDirContents fails if the directory is missing. This can lead to issues with filesystems that don't support empty directories (see ARROW-12358) and it is the behavior desired by the datasets API. We should be able to ignore missing directories. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16158) [C++] rename ARROW_ENGINE to ARROW_SUBSTRAIT
Jonathan Keane created ARROW-16158: -- Summary: [C++] rename ARROW_ENGINE to ARROW_SUBSTRAIT Key: ARROW-16158 URL: https://issues.apache.org/jira/browse/ARROW-16158 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Jonathan Keane When we introduced substrait we reused the cmake + feature {{ARROW_ENGINE}} to mean compute+a few other things as well as the substrait consumer functionality. In general, right now, we don't yet need (or want) to build substrait in our packages (e.g. the R package) since many places don't yet take advantage of it. But the naming of the cmake or feature is now confusing: it effectively is only substrait if you separately enable copmute, etc. but it makes it sound like the query engine we have been building since 6.0.0 is disabled. We should rename {{ARROW_ENGINE}} to {{ARROW_SUBSTRAIT}} now and then we can add an {{ARROW_ENGINE}} later if we need to encompass a larger set of engine functionality (e.g. compute+spillover+scheduler+memory limits) if that's needed. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16157) [R] Inconsistent behavior for arrow datasets vs working in memory
Egill Axfjord Fridgeirsson created ARROW-16157: -- Summary: [R] Inconsistent behavior for arrow datasets vs working in memory Key: ARROW-16157 URL: https://issues.apache.org/jira/browse/ARROW-16157 Project: Apache Arrow Issue Type: Bug Affects Versions: 7.0.0 Environment: Ubuntu 21.10 R 4.1.3. Arrow 7.0.0 Reporter: Egill Axfjord Fridgeirsson When I generate a sparse matrix using indices from an arrow dataset I get inconsistent behavior, sometimes there are duplicated indexes resulting in a matrix with values more than one at some places. When loading the dataset first in memory everything works as expected and all the values are one Repro {code:java} library(Matrix) library(dplyr) library(arrow) sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T") dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1) arrow::write_dataset(dF, path='./data/feather', format='feather') arrowDataset <- arrow::open_dataset('./data/feather', format='feather') # run the below a few times, and at some time the output is more than just # 1 for unique(newSparse@x), indicating there are duplicate indices for # the sparse matrix (then it adds the values there) newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) , j = arrowDataset %>% pull(j), x = 1) unique(newSparse@x) # here is the bug, @x is the slot for values arrowInMemory <- arrowDataset %>% collect() # after loading in memory the output is never more than 1 no matter how # often I run it newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) , j = arrowInMemory %>% pull(j), x = 1) unique(newSparse@x){code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16156) [R] Clarify warning message for features not turned on in .onAttach()
Dewey Dunnington created ARROW-16156: Summary: [R] Clarify warning message for features not turned on in .onAttach() Key: ARROW-16156 URL: https://issues.apache.org/jira/browse/ARROW-16156 Project: Apache Arrow Issue Type: Improvement Reporter: Dewey Dunnington After ARROW-15818 We get an extra message on package load because most users will not have `-DARROW_ENGINE=ON`. We should add "engine" to the list of capabilities that we don't warn about ( https://github.com/apache/arrow/blob/master/r/R/arrow-package.R#L264-L270 ) and perhaps clarify the message so that it's more obvious why it shows up. {noformat} library(arrow) #> See arrow_info() for available features {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16155) [R] lubridate functions for 9.0.0
Alessandro Molina created ARROW-16155: - Summary: [R] lubridate functions for 9.0.0 Key: ARROW-16155 URL: https://issues.apache.org/jira/browse/ARROW-16155 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 8.0.0 Reporter: Alessandro Molina Assignee: Dragoș Moldovan-Grünfeld Fix For: 9.0.0 Umbrella ticket for lubridate functions in 9.0.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16154) [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing
Nicola Crane created ARROW-16154: Summary: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing Key: ARROW-16154 URL: https://issues.apache.org/jira/browse/ARROW-16154 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Fix For: 8.0.0 See discussion here for context: https://github.com/apache/arrow/pull/12826#issuecomment-1092052001 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16153) [JS] Consider implementing a tableFromArray
Dominik Moritz created ARROW-16153: -- Summary: [JS] Consider implementing a tableFromArray Key: ARROW-16153 URL: https://issues.apache.org/jira/browse/ARROW-16153 Project: Apache Arrow Issue Type: Bug Components: JavaScript Reporter: Dominik Moritz Assignee: Dominik Moritz The idea here is to implement a function that creates a table from an array of objects using the struct builder. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16152) [C++] Typo that causes segfault with unknown functions in Substrait
Dewey Dunnington created ARROW-16152: Summary: [C++] Typo that causes segfault with unknown functions in Substrait Key: ARROW-16152 URL: https://issues.apache.org/jira/browse/ARROW-16152 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Dewey Dunnington There is a typo in {{ExtensionSet::Make()}} that causes a crash whenever somebody provides an unsupported function into the Substrait consumer. It looks like this was a copy/paste error here where {{type_ids}} should be {{function_ids}}. https://github.com/apache/arrow/blob/a935c81b595d24179e115d64cda944efa93aa0e0/cpp/src/arrow/engine/substrait/extension_set.cc#L167-L168 To reproduce via the R bindings: {noformat} arrow:::do_exec_plan_substrait(' { "extensionUris": [ { "extensionUriAnchor": 1 } ], "extensions": [ { "extensionFunction": { "extensionUriReference": 1, "functionAnchor": 2, "name": "abs_checked" } } ], "relations": [ { "rel": { "project": { "input": { "read": { "baseSchema": { "names": [ "letter", "number" ], "struct": { "types": [ { "string": { } }, { "i32": { } } ] } }, "namedTable": { "names": [ "named_table_1" ] } } }, "expressions": [ { "scalarFunction": { "functionReference": 2, "args": [ { "selection": { "directReference": { "structField": { "field": 1 } } } } ], "outputType": { } } } ] } } } ] } ') {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16151) [C++][GANDIVA] Add alias varchar to castVarchar functions
Vinicius Souza Roque created ARROW-16151: Summary: [C++][GANDIVA] Add alias varchar to castVarchar functions Key: ARROW-16151 URL: https://issues.apache.org/jira/browse/ARROW-16151 Project: Apache Arrow Issue Type: New Feature Components: C++ - Gandiva Reporter: Vinicius Souza Roque -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16150) [C++][GANDIVA] Add alias 'decimal' to castDecimal functions
Vinicius Souza Roque created ARROW-16150: Summary: [C++][GANDIVA] Add alias 'decimal' to castDecimal functions Key: ARROW-16150 URL: https://issues.apache.org/jira/browse/ARROW-16150 Project: Apache Arrow Issue Type: New Feature Components: C++ - Gandiva Reporter: Vinicius Souza Roque -- This message was sent by Atlassian Jira (v8.20.1#820001)