[jira] [Created] (ARROW-16161) [C++] Overhead of std::shared_ptr copies is causing thread contention

2022-04-08 Thread Weston Pace (Jira)
Weston Pace created ARROW-16161:
---

 Summary: [C++] Overhead of std::shared_ptr copies is 
causing thread contention
 Key: ARROW-16161
 URL: https://issues.apache.org/jira/browse/ARROW-16161
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Weston Pace


We created a benchmark to measure ExecuteScalarExpression performance in 
ARROW-16014.  We noticed significant thread contention (even though there 
shouldn't be much, if any, for this task) As part of ARROW-16138 we have been 
investigating possible causes.

One cause seems to be contention from copying shared_ptr objects.

Two possible solutions jump to mind and I'm sure there are many more.

ExecBatch is an internal type and used inside of ExecuteScalarExpression as 
well as inside of the execution engine.  In the former we can safely assume the 
data types will exist for the duration of the call.  In the latter we can 
safely assume the data types will exist for the duration of the execution plan. 
 Thus we can probably take a more targetted fix and migrate only ExecBatch to 
using DataType* (or const DataType&).

On the other hand, we might consider a more global approach.  All of our 
"stock" data types are assumed to have static storage duration.  However, we 
must use std::shared_ptr because users could create their own 
extension types.  We could invent an "extension type registration" system where 
extension types must first be registered with the C++ lib before being used.  
Then we could have long-lived DataType instances and we could replace 
std::shared_ptr with DataType* (or const DataType&) throughout most 
of the entire code base.

But, as I mentioned, I'm sure there are many approaches to take.  CC 
[~lidavidm] and [~apitrou] and [~yibocai] for thoughts but this might be 
interesting for just about any C++ dev.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16160) [C++] IPC Stream Reader doesn't check if extra fields are present for RecordBatches

2022-04-08 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-16160:
---

 Summary: [C++] IPC Stream Reader doesn't check if extra fields are 
present for RecordBatches
 Key: ARROW-16160
 URL: https://issues.apache.org/jira/browse/ARROW-16160
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Affects Versions: 6.0.1
Reporter: Micah Kornfield


I looked through recent commits and I don't think this issue has been patched 
since:

{code:title=test.python|borderStyle=solid}
import pyarrow as pa
with pa.output_stream("/tmp/f1") as sink:
  with pa.RecordBatchStreamWriter(sink, rb1.schema) as writer:
writer.write(rb1)
end_rb1 = sink.tell()

with pa.output_stream("/tmp/f2") as sink:
  with pa.RecordBatchStreamWriter(sink, rb2.schema) as writer:
writer.write(rb2)
start_rb2_only = sink.tell()
writer.write(rb2)
end_rb2 = sink.tell()

# Stitch to togher rb1.schema, rb1 and rb2 without schema.
with pa.output_stream("/tmp/f3") as sink:
  with pa.input_stream("/tmp/f1") as inp:
 sink.write(inp.read(end_rb1))
  with pa.input_stream("/tmp/f2") as inp:
inp.seek(start_rb2_only)
sink.write(inp.read(end_rb2 - start_rb2_only))

with pa.ipc.open_stream("/tmp/f3") as sink:
  print(sink.read_all())
{code}
Yields:
{code}
{{pyarrow.Table
c1: int64

c1: [[1],[1]]
{code}

I would expect this to error because the second stiched in record batch has 
more fields then necessary but it appears to load just fine.  

Is this intended behavior?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16159) [C++] Allow FileSystem::DeleteDirContents to succeed if the directory is missing

2022-04-08 Thread Weston Pace (Jira)
Weston Pace created ARROW-16159:
---

 Summary: [C++] Allow FileSystem::DeleteDirContents to succeed if 
the directory is missing
 Key: ARROW-16159
 URL: https://issues.apache.org/jira/browse/ARROW-16159
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace
Assignee: Weston Pace


Currently DeleteDirContents fails if the directory is missing.  This can lead 
to issues with filesystems that don't support empty directories (see 
ARROW-12358) and it is the behavior desired by the datasets API.  We should be 
able to ignore missing directories.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16158) [C++] rename ARROW_ENGINE to ARROW_SUBSTRAIT

2022-04-08 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-16158:
--

 Summary: [C++] rename ARROW_ENGINE to ARROW_SUBSTRAIT
 Key: ARROW-16158
 URL: https://issues.apache.org/jira/browse/ARROW-16158
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Jonathan Keane


When we introduced substrait we reused the cmake + feature {{ARROW_ENGINE}} to 
mean compute+a few other things as well as the substrait consumer 
functionality. In general, right now, we don't yet need (or want) to build 
substrait in our packages (e.g. the R package) since many places don't yet take 
advantage of it. But the naming of the cmake or feature is now confusing: it 
effectively is only substrait if you separately enable copmute, etc. but it 
makes it sound like the query engine we have been building since 6.0.0 is 
disabled.

We should rename {{ARROW_ENGINE}} to {{ARROW_SUBSTRAIT}} now and then we can 
add an {{ARROW_ENGINE}} later if we need to encompass a larger set of engine 
functionality (e.g. compute+spillover+scheduler+memory limits) if that's needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16157) [R] Inconsistent behavior for arrow datasets vs working in memory

2022-04-08 Thread Egill Axfjord Fridgeirsson (Jira)
Egill Axfjord Fridgeirsson created ARROW-16157:
--

 Summary: [R] Inconsistent behavior for arrow datasets vs working 
in memory
 Key: ARROW-16157
 URL: https://issues.apache.org/jira/browse/ARROW-16157
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 7.0.0
 Environment: Ubuntu 21.10
R 4.1.3.
Arrow 7.0.0
Reporter: Egill Axfjord Fridgeirsson


When I generate a sparse matrix using indices from an arrow dataset I get 
inconsistent behavior, sometimes there are duplicated indexes resulting in a 
matrix with values more than one at some places. When loading the dataset first 
in memory everything works as expected and all the values are one

Repro
{code:java}
library(Matrix)
library(dplyr)
library(arrow)

sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T")

dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1)

arrow::write_dataset(dF, path='./data/feather', format='feather')
arrowDataset <- arrow::open_dataset('./data/feather', format='feather')

# run the below a few times, and at some time the output is more than just # 1 
for unique(newSparse@x), indicating there are duplicate indices for  
# the sparse matrix (then it adds the values there)
newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) ,
                                  j = arrowDataset %>% pull(j),
                                  x = 1)
unique(newSparse@x) # here is the bug, @x is the slot for values


arrowInMemory <- arrowDataset %>% collect()

# after loading in memory the output is never more than 1 no matter how 
# often I run it
newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) ,
                                  j = arrowInMemory %>% pull(j),
                                  x = 1)
unique(newSparse@x){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16156) [R] Clarify warning message for features not turned on in .onAttach()

2022-04-08 Thread Dewey Dunnington (Jira)
Dewey Dunnington created ARROW-16156:


 Summary: [R] Clarify warning message for features not turned on in 
.onAttach()
 Key: ARROW-16156
 URL: https://issues.apache.org/jira/browse/ARROW-16156
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Dewey Dunnington


After ARROW-15818 We get an extra message on package load because most users 
will not have `-DARROW_ENGINE=ON`. We should add "engine" to the list of 
capabilities that we don't warn about ( 
https://github.com/apache/arrow/blob/master/r/R/arrow-package.R#L264-L270 ) and 
perhaps clarify the message so that it's more obvious why it shows up.

{noformat}
library(arrow)
#> See arrow_info() for available features
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16155) [R] lubridate functions for 9.0.0

2022-04-08 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-16155:
-

 Summary: [R] lubridate functions for 9.0.0
 Key: ARROW-16155
 URL: https://issues.apache.org/jira/browse/ARROW-16155
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 8.0.0
Reporter: Alessandro Molina
Assignee: Dragoș Moldovan-Grünfeld
 Fix For: 9.0.0


Umbrella ticket for lubridate functions in 9.0.0



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16154) [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing

2022-04-08 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16154:


 Summary: [R] Errors which pass through `handle_csv_read_error()` 
and `handle_parquet_io_error()` need better error tracing
 Key: ARROW-16154
 URL: https://issues.apache.org/jira/browse/ARROW-16154
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane
 Fix For: 8.0.0


See discussion here for context: 
https://github.com/apache/arrow/pull/12826#issuecomment-1092052001



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16153) [JS] Consider implementing a tableFromArray

2022-04-08 Thread Dominik Moritz (Jira)
Dominik Moritz created ARROW-16153:
--

 Summary: [JS] Consider implementing a tableFromArray
 Key: ARROW-16153
 URL: https://issues.apache.org/jira/browse/ARROW-16153
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Dominik Moritz
Assignee: Dominik Moritz


The idea here is to implement a function that creates a table from an array of 
objects using the struct builder. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16152) [C++] Typo that causes segfault with unknown functions in Substrait

2022-04-08 Thread Dewey Dunnington (Jira)
Dewey Dunnington created ARROW-16152:


 Summary: [C++] Typo that causes segfault with unknown functions in 
Substrait
 Key: ARROW-16152
 URL: https://issues.apache.org/jira/browse/ARROW-16152
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Dewey Dunnington


There is a typo in {{ExtensionSet::Make()}} that causes a crash whenever 
somebody provides an unsupported function into the Substrait consumer. It looks 
like this was a copy/paste error here where {{type_ids}} should be 
{{function_ids}}.

https://github.com/apache/arrow/blob/a935c81b595d24179e115d64cda944efa93aa0e0/cpp/src/arrow/engine/substrait/extension_set.cc#L167-L168

To reproduce via the R bindings:

{noformat}
arrow:::do_exec_plan_substrait('
{
  "extensionUris": [
{
  "extensionUriAnchor": 1
}
  ],
  "extensions": [
{
  "extensionFunction": {
"extensionUriReference": 1,
"functionAnchor": 2,
"name": "abs_checked"
  }
}
  ],
  "relations": [
{
  "rel": {
"project": {
  "input": {
"read": {
  "baseSchema": {
"names": [
  "letter",
  "number"
],
"struct": {
  "types": [
{
  "string": {

  }
},
{
  "i32": {

  }
}
  ]
}
  },
  "namedTable": {
"names": [
  "named_table_1"
]
  }
}
  },
  "expressions": [
{
  "scalarFunction": {
"functionReference": 2,
"args": [
  {
"selection": {
  "directReference": {
"structField": {
  "field": 1
}
  }
}
  }
],
"outputType": {

}
  }
}
  ]
}
  }
}
  ]
}
')
{noformat}






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16151) [C++][GANDIVA] Add alias varchar to castVarchar functions

2022-04-08 Thread Vinicius Souza Roque (Jira)
Vinicius Souza Roque created ARROW-16151:


 Summary: [C++][GANDIVA] Add alias varchar to castVarchar functions
 Key: ARROW-16151
 URL: https://issues.apache.org/jira/browse/ARROW-16151
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Gandiva
Reporter: Vinicius Souza Roque






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16150) [C++][GANDIVA] Add alias 'decimal' to castDecimal functions

2022-04-08 Thread Vinicius Souza Roque (Jira)
Vinicius Souza Roque created ARROW-16150:


 Summary: [C++][GANDIVA] Add alias 'decimal' to castDecimal 
functions
 Key: ARROW-16150
 URL: https://issues.apache.org/jira/browse/ARROW-16150
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Gandiva
Reporter: Vinicius Souza Roque






--
This message was sent by Atlassian Jira
(v8.20.1#820001)