[jira] [Commented] (ARROW-18371) [C++] Expose *FromJSON helpers

2022-12-15 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17647959#comment-17647959
 ] 

Yaron Gvili commented on ARROW-18371:
-

The discussion went beyond the current subject of `*FromJSON` - let's update 
the subject or split the discussion.

I agree that `MakeBasicBatches` is not needed. While I'm less sure about 
whether `BatchesWithSchema` is indeed not needed, I'll take your opinion. 
Regarding random data generation, I'd prioritize having the functions; the node 
could come later. Regarding the assertion macros, they are specific to `gtest` 
and I'm wondering whether we could make them flexible and less opinionated - 
that would be a separate issue. Note that I have my own subset of these macros 
specific to `Catch2`, so this flexibility can be obtained.

> [C++] Expose *FromJSON helpers
> --
>
> Key: ARROW-18371
> URL: https://issues.apache.org/jira/browse/ARROW-18371
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Rok Mihevc
>Assignee: Bryce Mecum
>Priority: Major
>  Labels: testing
>
> {Array,{{Exec,Record}Batch}FromJSON helper functions would be useful when 
> testing in projects that use Arrow. BatchesWithSchema and MakeBasicBatches 
> could be considered as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18427) [C++] Support negative tolerance in `AsofJoinNode`

2022-12-07 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-18427:

Description: Currently, `AsofJoinNode` supports a tolerance that is 
non-negative, allowing past-joining, i.e., joining right-table rows with a 
timestamp at or before that of the left-table row. This issue will add support 
for a negative tolerance, which would allow future-joining too.  (was: 
Currently, `AsofJoinNode` supports a tolerance that is non-negative, allowing 
past-joining, i.e., joining right-table rows with a timestamp at or before that 
of the left-table row. This issue will add support for a positive tolerance, 
which would allow future-joining too.)

> [C++] Support negative tolerance in `AsofJoinNode`
> --
>
> Key: ARROW-18427
> URL: https://issues.apache.org/jira/browse/ARROW-18427
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> Currently, `AsofJoinNode` supports a tolerance that is non-negative, allowing 
> past-joining, i.e., joining right-table rows with a timestamp at or before 
> that of the left-table row. This issue will add support for a negative 
> tolerance, which would allow future-joining too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18427) [C++] Support negative tolerance in `AsofJoinNode`

2022-12-07 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-18427:

Summary: [C++] Support negative tolerance in `AsofJoinNode`  (was: [C++] 
Support negative toletance in `AsofJoinNode`)

> [C++] Support negative tolerance in `AsofJoinNode`
> --
>
> Key: ARROW-18427
> URL: https://issues.apache.org/jira/browse/ARROW-18427
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> Currently, `AsofJoinNode` supports a tolerance that is non-negative, allowing 
> past-joining, i.e., joining right-table rows with a timestamp at or before 
> that of the left-table row. This issue will add support for a positive 
> tolerance, which would allow future-joining too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18427) [C++] Support negative toletance in `AsofJoinNode`

2022-12-07 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-18427:

Summary: [C++] Support negative toletance in `AsofJoinNode`  (was: [C++] 
Suppose negative toletance in `AsofJoinNode`)

> [C++] Support negative toletance in `AsofJoinNode`
> --
>
> Key: ARROW-18427
> URL: https://issues.apache.org/jira/browse/ARROW-18427
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> Currently, `AsofJoinNode` supports a tolerance that is non-negative, allowing 
> past-joining, i.e., joining right-table rows with a timestamp at or before 
> that of the left-table row. This issue will add support for a positive 
> tolerance, which would allow future-joining too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18427) [C++] Suppose negative toletance in `AsofJoinNode`

2022-12-07 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18427:
---

 Summary: [C++] Suppose negative toletance in `AsofJoinNode`
 Key: ARROW-18427
 URL: https://issues.apache.org/jira/browse/ARROW-18427
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili


Currently, `AsofJoinNode` supports a tolerance that is non-negative, allowing 
past-joining, i.e., joining right-table rows with a timestamp at or before that 
of the left-table row. This issue will add support for a positive tolerance, 
which would allow future-joining too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18424) [C++] Fix Doxygen error on `arrow::engine::ConversionStrictness`

2022-12-05 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18424:
---

 Summary: [C++] Fix Doxygen error on 
`arrow::engine::ConversionStrictness`
 Key: ARROW-18424
 URL: https://issues.apache.org/jira/browse/ARROW-18424
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili


Doxygen is hitting the following error: 
`/arrow/cpp/src/arrow/engine/substrait/options.h:37: error: documented symbol 
'enum ARROW_ENGINE_EXPORT arrow::engine::arrow::engine::ConversionStrictness' 
was not declared or defined. (warning treated as error, aborting now)`. See 
[this CI job 
output|https://github.com/apache/arrow/actions/runs/3557712768/jobs/5975904381],
 for example.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18417) [C++] Support emit info in Substrait extension-multi and AsOfJoin

2022-11-29 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17640748#comment-17640748
 ] 

Yaron Gvili commented on ARROW-18417:
-

cc [~westonpace]

> [C++] Support emit info in Substrait extension-multi and AsOfJoin
> -
>
> Key: ARROW-18417
> URL: https://issues.apache.org/jira/browse/ARROW-18417
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> Currently, Arrow-Substrait does not handle emit info that may appear in an 
> extension-multi in a Substrait plan. Besides the generic handling in the 
> Arrow-Substrait extension API, specific handling for AsOfJoin is required, 
> because AsOfJoinNode produces an output schema that is different than the one 
> used in the emit info. In particular, the AsOfJoinNode output scheme does not 
> include on- and by-keys of right tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18417) [C++] Support emit info in Substrait extension-multi and AsOfJoin

2022-11-29 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18417:
---

 Summary: [C++] Support emit info in Substrait extension-multi and 
AsOfJoin
 Key: ARROW-18417
 URL: https://issues.apache.org/jira/browse/ARROW-18417
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili


Currently, Arrow-Substrait does not handle emit info that may appear in an 
extension-multi in a Substrait plan. Besides the generic handling in the 
Arrow-Substrait extension API, specific handling for AsOfJoin is required, 
because AsOfJoinNode produces an output schema that is different than the one 
used in the emit info. In particular, the AsOfJoinNode output scheme does not 
include on- and by-keys of right tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18402) [C++] Expose `DeclarationInfo`

2022-11-25 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17638633#comment-17638633
 ] 

Yaron Gvili commented on ARROW-18402:
-

cc [~westonpace]

> [C++] Expose `DeclarationInfo`
> --
>
> Key: ARROW-18402
> URL: https://issues.apache.org/jira/browse/ARROW-18402
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> `DeclarationInfo` is just a pair of `Declaration` and `Schema`, which are 
> public APIs, and so can be made public API itself. This can be part of or a 
> follow-up on [https://github.com/apache/arrow/pull/14485], and will allow 
> implementing extension providers, whose API depends on `DeclarationInfo`, 
> outside of the Arrow repo.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18402) [C++] Expose `DeclarationInfo`

2022-11-24 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-18402:

Description: `DeclarationInfo` is just a pair of `Declaration` and 
`Schema`, which are public APIs, and so can be made public API itself. This can 
be part of or a follow-up on [https://github.com/apache/arrow/pull/14485], and 
will allow implementing extension providers, whose API depends on 
`DeclarationInfo`, outside of the Arrow repo.

> [C++] Expose `DeclarationInfo`
> --
>
> Key: ARROW-18402
> URL: https://issues.apache.org/jira/browse/ARROW-18402
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> `DeclarationInfo` is just a pair of `Declaration` and `Schema`, which are 
> public APIs, and so can be made public API itself. This can be part of or a 
> follow-up on [https://github.com/apache/arrow/pull/14485], and will allow 
> implementing extension providers, whose API depends on `DeclarationInfo`, 
> outside of the Arrow repo.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18402) [C++] Expose `DeclarationInfo`

2022-11-24 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18402:
---

 Summary: [C++] Expose `DeclarationInfo`
 Key: ARROW-18402
 URL: https://issues.apache.org/jira/browse/ARROW-18402
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18369) [C++] Support nested references as segment ids

2022-11-20 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-18369:

Description: This is a [follow-up 
task|https://github.com/apache/arrow/pull/14352#discussion_r1026945315] for a 
PR.

> [C++] Support nested references as segment ids
> --
>
> Key: ARROW-18369
> URL: https://issues.apache.org/jira/browse/ARROW-18369
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Priority: Major
>
> This is a [follow-up 
> task|https://github.com/apache/arrow/pull/14352#discussion_r1026945315] for a 
> PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18369) [C++] Support nested references as segment ids

2022-11-20 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18369:
---

 Summary: [C++] Support nested references as segment ids
 Key: ARROW-18369
 URL: https://issues.apache.org/jira/browse/ARROW-18369
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Yaron Gvili






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18368) [Python] Expose grouping segment keys to PyArrow

2022-11-20 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18368:
---

 Summary: [Python] Expose grouping segment keys to PyArrow
 Key: ARROW-18368
 URL: https://issues.apache.org/jira/browse/ARROW-18368
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Yaron Gvili


This is a [follow-up 
task|https://github.com/apache/arrow/pull/14352#discussion_r1026926422] for a 
PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18342) [C++] AsofJoinNode support for Boolean data field

2022-11-16 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili reassigned ARROW-18342:
---

Assignee: Yaron Gvili

> [C++] AsofJoinNode support for Boolean data field
> -
>
> Key: ARROW-18342
> URL: https://issues.apache.org/jira/browse/ARROW-18342
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Rok Mihevc
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: kernel
>
> This is to add boolean data field support to asof join as proposed here: 
> https://github.com/westonpace/arrow/pull/24



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18310) [C++] Use atomic backpressure counter

2022-11-11 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili reassigned ARROW-18310:
---

Assignee: Yaron Gvili

> [C++] Use atomic backpressure counter
> -
>
> Key: ARROW-18310
> URL: https://issues.apache.org/jira/browse/ARROW-18310
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> There are a few places in the code (sink_node.cc, source_node.cc, 
> file_base.cc) where the backpressure counter is of type `int32_t`. This 
> prevents `ExecNode::Pause(...)` and  `ExecNode::Resume(...)` from being 
> thread-safe. The proposal is to make these backpressure counters be of type 
> `std::atomic`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18312) [C++] Optimize output sizes in segmented aggregation

2022-11-11 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18312:
---

 Summary: [C++] Optimize output sizes in segmented aggregation
 Key: ARROW-18312
 URL: https://issues.apache.org/jira/browse/ARROW-18312
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yaron Gvili


This is a [follow-up 
task|https://github.com/apache/arrow/pull/14352#discussion_r1019661909] for a 
currently pending PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18311) [C++] Add `Grouper::Reset`

2022-11-11 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18311:
---

 Summary: [C++] Add `Grouper::Reset`
 Key: ARROW-18311
 URL: https://issues.apache.org/jira/browse/ARROW-18311
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yaron Gvili


Adding `Grouper::Reset` will enable it to be reused in segmented streaming. 
`See [this 
post|https://github.com/apache/arrow/pull/14352#discussion_r1016640969].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18310) [C++] Use atomic backpressure counter

2022-11-11 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18310:
---

 Summary: [C++] Use atomic backpressure counter
 Key: ARROW-18310
 URL: https://issues.apache.org/jira/browse/ARROW-18310
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yaron Gvili


There are a few places in the code (sink_node.cc, source_node.cc, file_base.cc) 
where the backpressure counter is of type `int32_t`. This prevents 
`ExecNode::Pause(...)` and  `ExecNode::Resume(...)` from being thread-safe. The 
proposal is to make these backpressure counters be of type 
`std::atomic`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16211) [C++][Python] Unregister compute functions

2022-11-07 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629698#comment-17629698
 ] 

Yaron Gvili commented on ARROW-16211:
-

Which PR would this be?

> [C++][Python] Unregister compute functions
> --
>
> Key: ARROW-16211
> URL: https://issues.apache.org/jira/browse/ARROW-16211
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In general, when using UDFs, the user defines a function expecting a 
> particular outcome. When building the program, there needs to be a way to 
> update existing function kernels if it expands beyond what is planned before. 
> In such situations, there should be a way to remove the existing definition 
> and add a new definition. To enable this, the unregister functionality has to 
> be included. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-16811) [C++] Remove default exec context from Expression::Bind

2022-11-06 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552993#comment-17552993
 ] 

Yaron Gvili edited comment on ARROW-16811 at 11/7/22 6:49 AM:
--

You are referring to [this 
post|https://issues.apache.org/jira/browse/ARROW-16796?focusedCommentId=17552569=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17552569].
 This limited-bind is not ideal, though it can be useful as an intermediate 
solution in places in the code that cannot be easily changed to a work with a 
non-default ExecContext. I imagine this could be the case in some user-facing 
APIs that currently do not take an ExecContext, and eventually defaults to the 
global function registry (perhaps examples exist in the dataset package?). In 
such cases, there are two options to consider: either break user code to force 
it to provide an ExecContext, or keep user-code intact but fail on runtime when 
an expression gets bound in a non-safe way. The latter one is what I wanted to 
draw attention to.


was (Author: JIRAUSER284707):
You are referring to this post. This limited-bind is not ideal, though it can 
be useful as an intermediate solution in places in the code that cannot be 
easily changed to a work with a non-default ExecContext. I imagine this could 
be the case in some user-facing APIs that currently do not take an ExecContext, 
and eventually defaults to the global function registry (perhaps examples exist 
in the dataset package?). In such cases, there are two options to consider: 
either break user code to force it to provide an ExecContext, or keep user-code 
intact but fail on runtime when an expression gets bound in a non-safe way. The 
latter one is what I wanted to draw attention to.

> [C++] Remove default exec context from Expression::Bind
> ---
>
> Key: ARROW-16811
> URL: https://issues.apache.org/jira/browse/ARROW-16811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This came up in https://github.com/apache/arrow/pull/13355.
> It is maybe not very intuitive that Expression::Bind would require an 
> ExecContext and so we never provided one.  However, when binding expressions 
> we need to lookup kernels, and that requires a function registry.  Defaulting 
> to default_exec_context is something that should be done at a higher level 
> and so we should not allow ExecContext to be omitted when calling Bind.
> Furthermore, [~rtpsw] has suggested that we might want to split 
> Expression::Bind into two variants.  One which requires an ExecContext and 
> one which does not (but fails if it encounters a "call").



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16811) [C++] Remove default exec context from Expression::Bind

2022-11-06 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili reassigned ARROW-16811:
---

Assignee: Yaron Gvili

> [C++] Remove default exec context from Expression::Bind
> ---
>
> Key: ARROW-16811
> URL: https://issues.apache.org/jira/browse/ARROW-16811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Yaron Gvili
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This came up in https://github.com/apache/arrow/pull/13355.
> It is maybe not very intuitive that Expression::Bind would require an 
> ExecContext and so we never provided one.  However, when binding expressions 
> we need to lookup kernels, and that requires a function registry.  Defaulting 
> to default_exec_context is something that should be done at a higher level 
> and so we should not allow ExecContext to be omitted when calling Bind.
> Furthermore, [~rtpsw] has suggested that we might want to split 
> Expression::Bind into two variants.  One which requires an ExecContext and 
> one which does not (but fails if it encounters a "call").



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18135) [C++] Avoid warnings that ExecBatch::length may be uninitialized

2022-10-23 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18135:
---

 Summary: [C++] Avoid warnings that ExecBatch::length may be 
uninitialized
 Key: ARROW-18135
 URL: https://issues.apache.org/jira/browse/ARROW-18135
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili


Here is a build log of the master branch that shows the warnings:
{code:java}
[131/534] Building CXX object 
src/arrow/CMakeFiles/arrow_objlib.dir/compute/exec/exec_plan.cc.o
In file included from 
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:25:
In constructor 
‘arrow::compute::ExecBatch::ExecBatch(arrow::compute::ExecBatch&&)’,
    inlined from 
‘arrow::compute::DeclarationToExecBatchesAsync(arrow::compute::Declaration, 
arrow::compute::ExecContext*))>’
 at 
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:605:73,
    inlined from ‘_OIter std::transform(_IIter, _IIter, _OIter, 
_UnaryOperation) [with _IIter = 
std::move_iterator<__gnu_cxx::__normal_iterator*,
 std::vector, 
std::allocator > > > >; _OIter = 
std::back_insert_iterator >; 
_UnaryOperation = 
arrow::compute::DeclarationToExecBatchesAsync(arrow::compute::Declaration, 
arrow::compute::ExecContext*))>]’
 at /usr/include/c++/11/bits/stl_algo.h:4296:12,
    inlined from ‘std::vector arrow::internal::MapVector(Fn&&, 
std::vector<_ValT>&&) [with Fn = 
arrow::compute::DeclarationToExecBatchesAsync(arrow::compute::Declaration, 
arrow::compute::ExecContext*))>;
 From = std::optional; To = 
arrow::compute::ExecBatch]’ at 
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/util/vector.h:102:17,
    inlined from 
‘arrow::compute::DeclarationToExecBatchesAsync(arrow::compute::Declaration, 
arrow::compute::ExecContext*)::’ at 
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:604:44,
    inlined from ‘typename std::enable_if<(((! 
std::is_void::value) && (! 
arrow::detail::is_future::value)) && ((! NextFuture::is_empty) 
|| std::is_same::value))>::type 
arrow::detail::ContinueFuture::operator()(NextFuture, ContinueFunc&&, Args&& 
...) const [with ContinueFunc = 
arrow::compute::DeclarationToExecBatchesAsync(arrow::compute::Declaration, 
arrow::compute::ExecContext*)::; Args = {}; ContinueResult = 
arrow::Result >; NextFuture = 
arrow::Future >]’ at 
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/util/future.h:150:22,
    inlined from ‘void 
arrow::detail::ContinueFuture::IgnoringArgsIf(std::true_type, NextFuture&&, 
ContinueFunc&&, Args&& ...) const [with ContinueFunc = 
arrow::compute::DeclarationToExecBatchesAsync(arrow::compute::Declaration, 
arrow::compute::ExecContext*)::; NextFuture = 
arrow::Future >; Args = {const 
arrow::internal::Empty&}]’ at 
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/util/future.h:188:15,
    inlined from ‘void arrow::Future::ThenOnComplete::operator()(const arrow::Result&) && [with OnSuccess = 
arrow::compute::DeclarationToExecBatchesAsync(arrow::compute::Declaration, 
arrow::compute::ExecContext*)::; OnFailure = 
arrow::Future<>::PassthruOnFailure >; T = arrow::internal::Empty]’ at 
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/util/future.h:545:39:
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/compute/exec.h:179:21: 
warning: ‘*(arrow::compute::ExecBatch*)((char*)& + 
offsetof(std::optional,std::optional::.std::_Optional_base::)).arrow::compute::ExecBatch::length’ may be used 
uninitialized [-Wmaybe-uninitialized]
  179 | struct ARROW_EXPORT ExecBatch {
      |                     ^
In file included from /usr/include/c++/11/functional:65,
                 from 
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/compute/exec/exec_plan.h:22,
                 from 
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:18:
/usr/include/c++/11/bits/stl_algo.h: In member function ‘void 
arrow::Future::ThenOnComplete::operator()(const 
arrow::Result&) && [with OnSuccess = 
arrow::compute::DeclarationToExecBatchesAsync(arrow::compute::Declaration, 
arrow::compute::ExecContext*)::; OnFailure = 
arrow::Future<>::PassthruOnFailure >; T = arrow::internal::Empty]’:
/usr/include/c++/11/bits/stl_algo.h:4296:31: note: ‘’ declared here
 4296 |         *__result = __unary_op(*__first);
      |                     ~~^~
[247/534] Building CXX object 
src/arrow/CMakeFiles/arrow_testing_objlib.dir/compute/exec/test_util.cc.o
In file included from 
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/compute/exec/test_util.h:29,
                 from 
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/compute/exec/test_util.cc:18:
In constructor 
‘arrow::compute::ExecBatch::ExecBatch(arrow::compute::ExecBatch&&)’,
    inlined from ‘arrow::compute::StartAndCollect(arrow::compute::ExecPlan*, 
arrow::AsyncGenerator 

[jira] [Resolved] (ARROW-17964) [C++] Range data comparison for struct type may go out of bounds

2022-10-13 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili resolved ARROW-17964.
-
Resolution: Won't Fix

See PR for rationale.

> [C++] Range data comparison for struct type may go out of bounds
> 
>
> Key: ARROW-17964
> URL: https://issues.apache.org/jira/browse/ARROW-17964
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> When the struct-typed items being compared do not have the same length as the 
> struct, an index-access to child data may go out of bounds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18029) [Format] archery lint for cmake should show error details

2022-10-13 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18029:
---

 Summary: [Format] archery lint for cmake should show error details
 Key: ARROW-18029
 URL: https://issues.apache.org/jira/browse/ARROW-18029
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Yaron Gvili


Here is example output from a failed invocation  of `archery lint 
--cmake-format`:
 
INFO:archery:Running cmake-format linters
ERROR __main__.py:618: Check failed: 
/arrow/cpp/cmake_modules/ThirdpartyToolchain.cmake
 
It would be helpful to get the error details on failure, e.g., as a diff output 
like for C++. Granted, this may be low priority since `archery lint 
--cmake-format --fix` fixes the errors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18015) [C++] Add validation to ExecBatch

2022-10-12 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-18015:

Description: It's easy to get an invalid `ExecBatch`, if only because its 
fields are public, but there is currently no way to validate it. See 
[discussion 
here|https://github.com/apache/arrow/pull/14386#discussion_r993669256].

> [C++] Add validation to ExecBatch
> -
>
> Key: ARROW-18015
> URL: https://issues.apache.org/jira/browse/ARROW-18015
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Priority: Major
>
> It's easy to get an invalid `ExecBatch`, if only because its fields are 
> public, but there is currently no way to validate it. See [discussion 
> here|https://github.com/apache/arrow/pull/14386#discussion_r993669256].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18015) [C++] Add validation to ExecBatch

2022-10-12 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18015:
---

 Summary: [C++] Add validation to ExecBatch
 Key: ARROW-18015
 URL: https://issues.apache.org/jira/browse/ARROW-18015
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yaron Gvili






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18004) [C++] ExecBatch conversion to RecordBatch may go out of bounds

2022-10-12 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-18004:

Description: When the schema given to `ExecBatch::ToRecordBatch` is wider 
than the `ExecBatch` values vector, an out-of-bounds index is accessed.

> [C++] ExecBatch conversion to RecordBatch may go out of bounds
> --
>
> Key: ARROW-18004
> URL: https://issues.apache.org/jira/browse/ARROW-18004
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> When the schema given to `ExecBatch::ToRecordBatch` is wider than the 
> `ExecBatch` values vector, an out-of-bounds index is accessed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18004) [C++] ExecBatch conversion to RecordBatch may go out of bounds

2022-10-12 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18004:
---

 Summary: [C++] ExecBatch conversion to RecordBatch may go out of 
bounds
 Key: ARROW-18004
 URL: https://issues.apache.org/jira/browse/ARROW-18004
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17980) [C++] As-of-Join Substrait extension

2022-10-12 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-17980:

Description: This issue will add Arrow support for handling As-Of-Join as a 
Substrait extension based on [Weston's Substrait extension 
prototype|https://github.com/apache/arrow/compare/master...westonpace:arrow:experiment/substrait-extension].

> [C++] As-of-Join Substrait extension
> 
>
> Key: ARROW-17980
> URL: https://issues.apache.org/jira/browse/ARROW-17980
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> This issue will add Arrow support for handling As-Of-Join as a Substrait 
> extension based on [Weston's Substrait extension 
> prototype|https://github.com/apache/arrow/compare/master...westonpace:arrow:experiment/substrait-extension].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17980) [C++] As-of-Join Substrait extension

2022-10-10 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-17980:
---

 Summary: [C++] As-of-Join Substrait extension
 Key: ARROW-17980
 URL: https://issues.apache.org/jira/browse/ARROW-17980
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16211) [C++][Python] Unregister compute functions

2022-10-08 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614394#comment-17614394
 ] 

Yaron Gvili commented on ARROW-16211:
-

> Only concern is about the usability. Not sure if that can be handled in this 
> Jira. 

Makes sense to tackle usability of the nested registry approach in a separate 
jira, perhaps in the context of one of the use cases Weston listed.

> I like the idea about an API to make use the nested registries as a stack and 
> make a caller which scans through the registries and finding the function 
> being called. Should we create a separate ticket for this? 

The caller would only need to pick up and query the nested registry instance at 
the top of the stack. This is because each nested registry instance except the 
bottom one is linked to a registry instance one step lower on the stack via a 
parent link and will refer to its parent automatically.

Yes, I think it is clearer to handle the implementation of a 
nested-registry-stack in one issue and any use of it (like in `CallFunction` or 
`pc.call_function`) in another issue.

> [C++][Python] Unregister compute functions
> --
>
> Key: ARROW-16211
> URL: https://issues.apache.org/jira/browse/ARROW-16211
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In general, when using UDFs, the user defines a function expecting a 
> particular outcome. When building the program, there needs to be a way to 
> update existing function kernels if it expands beyond what is planned before. 
> In such situations, there should be a way to remove the existing definition 
> and add a new definition. To enable this, the unregister functionality has to 
> be included. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17965) [C++] ExecBatch support for ChunkedArray values

2022-10-07 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614229#comment-17614229
 ] 

Yaron Gvili commented on ARROW-17965:
-

Not sure how much context you are asking for. I bumped into this when working 
on ordered aggregation. The test cases I worked out for this involved chunked 
arrays (but not an `ExecPlan`) that triggered the described failure that led to 
this issue.

I suppose it is possible to split to multiple `ExecBatch` instances, but I 
think this is not convenient for the user and potentially less efficient, e.g., 
in the context of streaming it is more efficient to consume a large `ExecBatch` 
with a chunked array than to consume multiple smaller `ExecBatch` instances 
with the same data.

I believe the proposed code is simple enough to warrant a review. Let me know 
your thoughts.

> [C++] ExecBatch support for ChunkedArray values
> ---
>
> Key: ARROW-17965
> URL: https://issues.apache.org/jira/browse/ARROW-17965
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> Currently, `ExecBatch` does not handle chunked arrays when printing or 
> slicing. The code assumes that if a value is not a scalar then it is an 
> array, and so will fail on chunked array values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17965) [C++] ExecBatch support for ChunkedArray values

2022-10-07 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-17965:

Description: Currently, `ExecBatch` does not handle chunked arrays when 
printing or slicing. The code assumes that if a value is not a scalar then it 
is an array, and so will fail on chunked array values.

> [C++] ExecBatch support for ChunkedArray values
> ---
>
> Key: ARROW-17965
> URL: https://issues.apache.org/jira/browse/ARROW-17965
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> Currently, `ExecBatch` does not handle chunked arrays when printing or 
> slicing. The code assumes that if a value is not a scalar then it is an 
> array, and so will fail on chunked array values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17965) [C++] ExecBatch support for ChunkedArray values

2022-10-07 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-17965:
---

 Summary: [C++] ExecBatch support for ChunkedArray values
 Key: ARROW-17965
 URL: https://issues.apache.org/jira/browse/ARROW-17965
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17964) [C++] Range data comparison for struct type may go out of bounds

2022-10-07 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-17964:

Description: When the struct-typed items being compared do not have the 
same length as the struct, an index-access to child data may go out of bounds.

> [C++] Range data comparison for struct type may go out of bounds
> 
>
> Key: ARROW-17964
> URL: https://issues.apache.org/jira/browse/ARROW-17964
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> When the struct-typed items being compared do not have the same length as the 
> struct, an index-access to child data may go out of bounds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17964) [C++] Range data comparison for struct type may go out of bounds

2022-10-07 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili reassigned ARROW-17964:
---

Assignee: Yaron Gvili

> [C++] Range data comparison for struct type may go out of bounds
> 
>
> Key: ARROW-17964
> URL: https://issues.apache.org/jira/browse/ARROW-17964
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17964) [C++] Range data comparison for struct type may go out of bounds

2022-10-07 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-17964:
---

 Summary: [C++] Range data comparison for struct type may go out of 
bounds
 Key: ARROW-17964
 URL: https://issues.apache.org/jira/browse/ARROW-17964
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Yaron Gvili






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16211) [C++][Python] Unregister compute functions

2022-10-07 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614131#comment-17614131
 ] 

Yaron Gvili commented on ARROW-16211:
-

I'm not against a specific and simple solution for a simple use case, and 
you're welcome to pursue it. In this discussion, my main aim was to explain how 
all use cases discussed here are supported in a straightforward way using 
nested registries and without the need to modify a registry instance while in 
use.

> In the regular case it is just always `pc.call_function(..)`, in this case we 
> have to always make sure we do `registry_x.call_function`, isn't it?

With this question, the discussion shifts from whether registry function 
removal is necessary (I argued it isn't) to how best to design a user API for 
calling registry functions in the context of at least this use case.

I argue we can design a user API that encapsulates the active registry, so that 
the function caller need not remember it, as follows. The execution context 
could manage a stack of nested registries, so that a call-function invocation 
would automatically lookup the registry at the top of the stack. When a piece 
of code wants to set up a nested registry for a second piece of code it intends 
to invoke, it does so by adding the nested registry to this stack, invoking the 
second piece of code, and popping the stack. This context stack management 
ensures the correct registry instance is always in scope.

Of course, that we can doesn't mean that we must. My aim in this point is to 
show that there is a well-designed alternative for registry function removal.

> While with an approach of the ability to just drop what you don't need is way 
> easier.

IMHO, it's a bit easier (e.g., removing a function from an existing registry 
instance vs creating a nested registry instance and removing from it) but less 
safe (potential side-effects and race conditions). A design tension between 
usability and safety is common, and calls for prioritization. My vote is to 
prioritize safety.

> May be we should also allow the ability to unregister/override functions. 
> That would provide flexibility for the users to use the UDFs for the said 
> scenarios.

If I'm forced to accept this way of registry editing, I'd say that then the 
docs would need to be very clear about the safety issues this practice raises 
and to describe a safer alternative as discussed here. I think if the safer 
alternative is not implemented via an easy API (like the one I described) then 
users will surely practice the less-safe alternative. This is why I view that 
adding these docs is a bit better but still insufficient for safety.

> [C++][Python] Unregister compute functions
> --
>
> Key: ARROW-16211
> URL: https://issues.apache.org/jira/browse/ARROW-16211
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In general, when using UDFs, the user defines a function expecting a 
> particular outcome. When building the program, there needs to be a way to 
> update existing function kernels if it expands beyond what is planned before. 
> In such situations, there should be a way to remove the existing definition 
> and add a new definition. To enable this, the unregister functionality has to 
> be included. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16211) [C++][Python] Unregister compute functions

2022-10-06 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17613332#comment-17613332
 ] 

Yaron Gvili commented on ARROW-16211:
-

Weston is correct that the main use case I designed nested registries for is 
embedded UDFs, or at least UDFs that are used in the context of a particular 
scope, such as a single plan's execution. However, see also [the case I noted 
here|https://issues.apache.org/jira/browse/ARROW-16211?focusedCommentId=17539044=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17539044].

> But that seems not very practical that you need to remember for each function 
> in which registry it lives to ensure you pass the correct one when calling it.

I believe you are referring to a burden on the user, but I don't think the user 
would actually need to do this. A normal case is when a first piece of code 
gets passed in a registry and wants to pass along a registry to a second piece 
of code. So, the first piece of code has two basic things it can do:
 # Pass the registry unmodified; this keeps working in the same scope.
 # Pass a nested registry wrapping it with some modifications; this creates a 
nested scope.

In these cases, the registry being used at each piece of code is managed on 
stack, so when the second piece of code returns the first piece of code 
continues with its original registry, and the user need not manage registries 
manually. It is straightforward to extend this to other cases, like passing 
from a main thread to working threads.

> > There is another case, where some set of UDFs are predefined and then 
> > referenced (e.g. by name) in incoming plans. In that scenario I think a 
> > nested registry is considerably less useful and the ability to unregister 
> > or override would be helpful.
> Yes, and that is the use case that I was talking about (since that is what 
> the pyarrow register_scalar_function enabled you to do)

Even in such a case where one must remove functions, I'd recommend creating a 
new registry instance with the desired functions removed (perhaps by 
initializing a builder from an existing registry instance) then to edit an 
existing registry instance that may be in use elsewhere, in order to make it 
much harder to (inadvertently) create side-effect or race conditions. Still, 
note that it is very easy for a nested registry to effectively support function 
removal, basically by registering a null function on a name, that overrides the 
same-named function of the parent registry; as usual, the nested registry can 
stay fixed once set up.

> [C++][Python] Unregister compute functions
> --
>
> Key: ARROW-16211
> URL: https://issues.apache.org/jira/browse/ARROW-16211
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In general, when using UDFs, the user defines a function expecting a 
> particular outcome. When building the program, there needs to be a way to 
> update existing function kernels if it expands beyond what is planned before. 
> In such situations, there should be a way to remove the existing definition 
> and add a new definition. To enable this, the unregister functionality has to 
> be included. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16211) [C++][Python] Unregister compute functions

2022-09-29 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610986#comment-17610986
 ] 

Yaron Gvili commented on ARROW-16211:
-

Side note: I have developed PyArrow wrappers for the nested registries as part 
of a prototype for UDFs that I should be able to contribute when the time comes.

I agree documentation of nested registries can be improved, both inside the 
code and in doc pages. We do have a 
[couple|https://github.com/apache/arrow/blob/902781d1f3a41563a23d6755433a8e40ce82de7b/cpp/src/arrow/compute/registry_test.cc#L119]
 of [test 
cases|https://github.com/apache/arrow/blob/902781d1f3a41563a23d6755433a8e40ce82de7b/cpp/src/arrow/compute/registry_test.cc#L213].
 Please create jiras for what you think is missing.

> [C++][Python] Unregister compute functions
> --
>
> Key: ARROW-16211
> URL: https://issues.apache.org/jira/browse/ARROW-16211
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In general, when using UDFs, the user defines a function expecting a 
> particular outcome. When building the program, there needs to be a way to 
> update existing function kernels if it expands beyond what is planned before. 
> In such situations, there should be a way to remove the existing definition 
> and add a new definition. To enable this, the unregister functionality has to 
> be included. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16211) [C++][Python] Unregister compute functions

2022-09-29 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610963#comment-17610963
 ] 

Yaron Gvili commented on ARROW-16211:
-

Consider a case where X, Y, and Z are disjoint sets of functions. One can 
register (the functions of) X in registry 1, then register Y in registry 2 
whose parent is registry 1, and finally register Z in registry 3 whose parent 
is registry 2. With respect to the case described by Vibhata, I believe the 
default registry corresponds to registry 1, the first subset of functions 
corresponds to Y, and the second to Z. This works nicely when X, Y, and Z are 
known upfront; otherwise, one may need to register again. For example, suppose 
Y1 and Y2 are disjoint sets of functions whose union is Y, and one wants to 
register functions on top of those of X and Y1, then one would need to create a 
fresh registry whose parent is registry 1, and register Y1 again on this fresh 
registry. I think this is a reasonable result when not knowing upfront. In 
general, considering registries have extended scopes, I think it is better to 
create a fresh nested registry and keep it fixed while in use than to edit an 
existing one using removals while in use.

> [C++][Python] Unregister compute functions
> --
>
> Key: ARROW-16211
> URL: https://issues.apache.org/jira/browse/ARROW-16211
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In general, when using UDFs, the user defines a function expecting a 
> particular outcome. When building the program, there needs to be a way to 
> update existing function kernels if it expands beyond what is planned before. 
> In such situations, there should be a way to remove the existing definition 
> and add a new definition. To enable this, the unregister functionality has to 
> be included. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16211) [C++][Python] Unregister compute functions

2022-09-22 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17608132#comment-17608132
 ] 

Yaron Gvili commented on ARROW-16211:
-

I think many PyArrow function execution code paths lead to Arrow C++ APIs with 
a configurable function registry. For example, the [`CallFunction` 
API|https://github.com/apache/arrow/blob/43e66a928e29a811b0e14f2a2d7ffa9f8290ccbe/cpp/src/arrow/compute/exec.h]
 accepts a function registry via its `ExecContext` argument. So, one could 
create a first nested function registry holding a first implementation of a 
UDF, use this first registry for several function execution invocations, then 
create a second nested function registry holding a second implementation of the 
UDF, and switch over to using this second registry. The Python interpreter 
remains the same, and there's no need to remove the UDF - the first registry 
can just be dropped.

> [C++][Python] Unregister compute functions
> --
>
> Key: ARROW-16211
> URL: https://issues.apache.org/jira/browse/ARROW-16211
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In general, when using UDFs, the user defines a function expecting a 
> particular outcome. When building the program, there needs to be a way to 
> update existing function kernels if it expands beyond what is planned before. 
> In such situations, there should be a way to remove the existing definition 
> and add a new definition. To enable this, the unregister functionality has to 
> be included. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17696) [C++] arrow-compute-asof-join-node-test inordinately slow

2022-09-21 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili reassigned ARROW-17696:
---

Assignee: Yaron Gvili

> [C++] arrow-compute-asof-join-node-test inordinately slow
> -
>
> Key: ARROW-17696
> URL: https://issues.apache.org/jira/browse/ARROW-17696
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It seems that {{arrow-compute-asof-join-node-test}} is designed such as each 
> sub-test takes 2 seconds.  The entire test file takes 120 seconds here.
> This is much too slow for a single test file and should drastically be 
> reduced.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17696) [C++] arrow-compute-asof-join-node-test inordinately slow

2022-09-21 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607718#comment-17607718
 ] 

Yaron Gvili commented on ARROW-17696:
-

Note that I'll set the timeout at the test-case level, not for the entire test 
executable. I'll cut it down by a factor of 10.

> [C++] arrow-compute-asof-join-node-test inordinately slow
> -
>
> Key: ARROW-17696
> URL: https://issues.apache.org/jira/browse/ARROW-17696
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> It seems that {{arrow-compute-asof-join-node-test}} is designed such as each 
> sub-test takes 2 seconds.  The entire test file takes 120 seconds here.
> This is much too slow for a single test file and should drastically be 
> reduced.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17696) [C++] arrow-compute-asof-join-node-test inordinately slow

2022-09-21 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607689#comment-17607689
 ] 

Yaron Gvili commented on ARROW-17696:
-

IIRC, at the time, I asked about the timeout for this test, and the answer was 
something like 300 seconds. How much do we want it to be? I can set  it rather 
flexibly.

> [C++] arrow-compute-asof-join-node-test inordinately slow
> -
>
> Key: ARROW-17696
> URL: https://issues.apache.org/jira/browse/ARROW-17696
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> It seems that {{arrow-compute-asof-join-node-test}} is designed such as each 
> sub-test takes 2 seconds.  The entire test file takes 120 seconds here.
> This is much too slow for a single test file and should drastically be 
> reduced.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-16211) [C++][Python] Unregister compute functions

2022-09-17 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606126#comment-17606126
 ] 

Yaron Gvili edited comment on ARROW-16211 at 9/17/22 2:21 PM:
--

I believe this issue is superseded by ARROW-16677 which allows creating a 
nested registry that can later be dropped lock-free along with all functions 
registered on it.


was (Author: JIRAUSER284707):
I believe this issue is superseded by 
https://issues.apache.org/jira/browse/ARROW-16677 which allows creating a 
nested registry that can later be dropped lock-free along with all functions 
registered on it.

> [C++][Python] Unregister compute functions
> --
>
> Key: ARROW-16211
> URL: https://issues.apache.org/jira/browse/ARROW-16211
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In general, when using UDFs, the user defines a function expecting a 
> particular outcome. When building the program, there needs to be a way to 
> update existing function kernels if it expands beyond what is planned before. 
> In such situations, there should be a way to remove the existing definition 
> and add a new definition. To enable this, the unregister functionality has to 
> be included. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16211) [C++][Python] Unregister compute functions

2022-09-17 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606126#comment-17606126
 ] 

Yaron Gvili commented on ARROW-16211:
-

I believe this issue is superseded by 
https://issues.apache.org/jira/browse/ARROW-16677 which allows creating a 
nested registry that can later be dropped lock-free along with all functions 
registered on it.

> [C++][Python] Unregister compute functions
> --
>
> Key: ARROW-16211
> URL: https://issues.apache.org/jira/browse/ARROW-16211
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In general, when using UDFs, the user defines a function expecting a 
> particular outcome. When building the program, there needs to be a way to 
> update existing function kernels if it expands beyond what is planned before. 
> In such situations, there should be a way to remove the existing definition 
> and add a new definition. To enable this, the unregister functionality has to 
> be included. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17676) [C++] [Python] User-defined tabular functions

2022-09-11 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-17676:
---

 Summary: [C++] [Python] User-defined tabular functions
 Key: ARROW-17676
 URL: https://issues.apache.org/jira/browse/ARROW-17676
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Reporter: Yaron Gvili
Assignee: Yaron Gvili


Currently, only a stateless user-defined function is supported in PyArrow. This 
issue will add support for a user-defined tabular function, which is a 
user-function implemented in Python that returns a stateful stream of tabular 
data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17653) [C++] AsofJoinNode 128-bit hashing

2022-09-08 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-17653:
---

 Summary: [C++] AsofJoinNode 128-bit hashing
 Key: ARROW-17653
 URL: https://issues.apache.org/jira/browse/ARROW-17653
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yaron Gvili


A recent version of `AsofJoinNode` uses 64-bit hashing for by-key values. This 
[leads to a non-negligible probability of 
collisions|https://github.com/apache/arrow/pull/13880] that are not arbitrated. 
Using 128-bit hashing, the probability will become negligible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17642) [C++] Add ordered aggregation

2022-09-07 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-17642:
---

 Summary: [C++] Add ordered aggregation
 Key: ARROW-17642
 URL: https://issues.apache.org/jira/browse/ARROW-17642
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili


{color:#00}Ordered aggregation is similar to grouped aggregation except 
that one column in the grouping key is (known to be) ordered. The result of 
both types of aggregations is the same but the existence of an ordered column 
enables optimizing.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17613) [C++] Add function execution API for a preconfigured kernel

2022-09-05 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-17613:
---

 Summary: [C++] Add function execution API for a preconfigured 
kernel
 Key: ARROW-17613
 URL: https://issues.apache.org/jira/browse/ARROW-17613
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili


Currently, the function execution API goes through kernel selection on each 
invocation. This issue will add a faster-path for executing a preconfigured 
kernel.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17610) [C++] Support additional source types in SourceNode

2022-09-04 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-17610:
---

 Summary: [C++] Support additional source types in SourceNode
 Key: ARROW-17610
 URL: https://issues.apache.org/jira/browse/ARROW-17610
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili


This issue will add support for `ArrayVector`, `ExecBatch`, and `RecordBatch` 
sources in `SourceNode`. See [this 
thread|https://lists.apache.org/thread/9l23c0w48ywx314klbyshz8ntyzgs1zw] for 
context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17492) [C++] Hashing32/64 support for large var-binary types

2022-08-22 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-17492:
---

 Summary: [C++] Hashing32/64 support for large var-binary types
 Key: ARROW-17492
 URL: https://issues.apache.org/jira/browse/ARROW-17492
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili


Currently, Hashing32/64 only supports non-large var-binary types. This issue 
will add support for large var-binary types.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17289) [C++] Add type category membership checks

2022-08-18 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-17289:

Description: 
-Currently, type categories are only available as vectors, e.g., the category 
of integer types is available via `arrow::IntTypes()`. This issue will add type 
category membership test, e.g. `arrow::IsIntType(type)`.-

Following discussions, this issue ended up covering the following:
 * Additional type category predicates
 * Convenience predicates accepting a `DataType` parameter
 * Documentation and testing of the predicates

  was:
Currently, type categories are only available as vectors, e.g., the category of 
integer types is available via `arrow::IntTypes()`. This issue will add type 
category membership test, e.g. `arrow::IsIntType(type)`.

Following discussions, this issue ended up covering the following:
 *  


> [C++] Add type category membership checks
> -
>
> Key: ARROW-17289
> URL: https://issues.apache.org/jira/browse/ARROW-17289
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> -Currently, type categories are only available as vectors, e.g., the category 
> of integer types is available via `arrow::IntTypes()`. This issue will add 
> type category membership test, e.g. `arrow::IsIntType(type)`.-
> Following discussions, this issue ended up covering the following:
>  * Additional type category predicates
>  * Convenience predicates accepting a `DataType` parameter
>  * Documentation and testing of the predicates



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17289) [C++] Add type category membership checks

2022-08-18 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-17289:

Description: 
Currently, type categories are only available as vectors, e.g., the category of 
integer types is available via `arrow::IntTypes()`. This issue will add type 
category membership test, e.g. `arrow::IsIntType(type)`.

Following discussions, this issue ended up covering the following:
 *  

  was:Currently, type categories are only available as vectors, e.g., the 
category of integer types is available via `arrow::IntTypes()`. This issue will 
add type category membership test, e.g. `arrow::IsIntType(type)`.


> [C++] Add type category membership checks
> -
>
> Key: ARROW-17289
> URL: https://issues.apache.org/jira/browse/ARROW-17289
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Currently, type categories are only available as vectors, e.g., the category 
> of integer types is available via `arrow::IntTypes()`. This issue will add 
> type category membership test, e.g. `arrow::IsIntType(type)`.
> Following discussions, this issue ended up covering the following:
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17290) [C++] Add order-comparisons for numeric scalars

2022-08-15 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili resolved ARROW-17290.
-
Resolution: Won't Fix

See [this 
post|https://github.com/apache/arrow/pull/13784#issuecomment-1211041269] for 
rationale.

> [C++] Add order-comparisons for numeric scalars
> ---
>
> Key: ARROW-17290
> URL: https://issues.apache.org/jira/browse/ARROW-17290
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Currently, only equal-comparison of scalars are supported, by 
> `EqualComparable`. This issue will add order-comparisons, such as less-than, 
> to numeric scalars.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17412) [C++] AsofJoin multiple keys and types

2022-08-15 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-17412:

Summary: [C++] AsofJoin multiple keys and types  (was: AsofJoin multiple 
keys and types)

> [C++] AsofJoin multiple keys and types
> --
>
> Key: ARROW-17412
> URL: https://issues.apache.org/jira/browse/ARROW-17412
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> Currently, AsofJoin supports a single (column) key of a limited set of types. 
> This issue will extend the support to multiple keys and types.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17412) AsofJoin multiple keys and types

2022-08-15 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-17412:
---

 Summary: AsofJoin multiple keys and types
 Key: ARROW-17412
 URL: https://issues.apache.org/jira/browse/ARROW-17412
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili


Currently, AsofJoin supports a single (column) key of a limited set of types. 
This issue will extend the support to multiple keys and types.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17290) [C++] Add order-comparisons for numeric scalars

2022-08-09 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577629#comment-17577629
 ] 

Yaron Gvili edited comment on ARROW-17290 at 8/9/22 9:16 PM:
-

{quote}I'm curious, what is the use case?
{quote}
See [this 
post|https://github.com/apache/arrow/pull/13784#issuecomment-1209861142].


was (Author: JIRAUSER284707):
{quote}I'm curious, what is the use case?
{quote}
See [this post|http://example.com].

> [C++] Add order-comparisons for numeric scalars
> ---
>
> Key: ARROW-17290
> URL: https://issues.apache.org/jira/browse/ARROW-17290
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently, only equal-comparison of scalars are supported, by 
> `EqualComparable`. This issue will add order-comparisons, such as less-than, 
> to numeric scalars.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17290) [C++] Add order-comparisons for numeric scalars

2022-08-09 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577629#comment-17577629
 ] 

Yaron Gvili commented on ARROW-17290:
-

{quote}I'm curious, what is the use case?
{quote}
See [this post|http://example.com].

> [C++] Add order-comparisons for numeric scalars
> ---
>
> Key: ARROW-17290
> URL: https://issues.apache.org/jira/browse/ARROW-17290
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently, only equal-comparison of scalars are supported, by 
> `EqualComparable`. This issue will add order-comparisons, such as less-than, 
> to numeric scalars.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17289) [C++] Add type category membership checks

2022-08-03 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574681#comment-17574681
 ] 

Yaron Gvili commented on ARROW-17289:
-

I'll adjust my proposed PR.

Regrading docs, a possible way to promote the cookbook is to add a link to it 
from many Arrow C++ public headers that would show up in generated reference 
pages.

> [C++] Add type category membership checks
> -
>
> Key: ARROW-17289
> URL: https://issues.apache.org/jira/browse/ARROW-17289
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, type categories are only available as vectors, e.g., the category 
> of integer types is available via `arrow::IntTypes()`. This issue will add 
> type category membership test, e.g. `arrow::IsIntType(type)`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17289) [C++] Add type category membership checks

2022-08-02 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574421#comment-17574421
 ] 

Yaron Gvili commented on ARROW-17289:
-

This is a small improvement proposal for convenience. Given a 
`std::shared_ptr type`, it's easy for the user to find 
`IsIntegerType(type)`. IIUC, using `type_traits.h` the code would be 
`is_integer(type->id())`, which has one more function that the user needs to 
find. Granted, I overlooked that the proposed implementation could be 
simplified using the type id.

Regarding documentation, when I search for "apache arrow type is integer", the 
top results are [https://arrow.apache.org/docs/r/reference/data-type.html,] 
[https://arrow.apache.org/docs/r/reference/data-type.html,] followed by PyArrow 
results. Adding "C++" to the search doesn't change the results much.

> [C++] Add type category membership checks
> -
>
> Key: ARROW-17289
> URL: https://issues.apache.org/jira/browse/ARROW-17289
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, type categories are only available as vectors, e.g., the category 
> of integer types is available via `arrow::IntTypes()`. This issue will add 
> type category membership test, e.g. `arrow::IsIntType(type)`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17290) [C++] Add order-comparisons for numeric scalars

2022-08-02 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-17290:
---

 Summary: [C++] Add order-comparisons for numeric scalars
 Key: ARROW-17290
 URL: https://issues.apache.org/jira/browse/ARROW-17290
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili


Currently, only equal-comparison of scalars are supported, by 
`EqualComparable`. This issue will add order-comparisons, such as less-than, to 
numeric scalars.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17289) [C++] Add type category membership checks

2022-08-02 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-17289:
---

 Summary: [C++] Add type category membership checks
 Key: ARROW-17289
 URL: https://issues.apache.org/jira/browse/ARROW-17289
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili


Currently, type categories are only available as vectors, e.g., the category of 
integer types is available via `arrow::IntTypes()`. This issue will add type 
category membership test, e.g. `arrow::IsIntType(type)`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16968) [C++] Expand Python-UDF support to Arrow Substrait

2022-07-02 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili reassigned ARROW-16968:
---

Assignee: Yaron Gvili

> [C++] Expand Python-UDF support to Arrow Substrait
> --
>
> Key: ARROW-16968
> URL: https://issues.apache.org/jira/browse/ARROW-16968
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> Currently, Python-UDFs are supported in Arrow at a low-level of abstraction. 
> This issue is about expanding this support to Arrow Substrait, which is at a 
> higher-level. This is intended for the Py/Arrow support of the use case in 
> which a Substrait plan with Python UDFs is manually expressed using Ibis, 
> automatically represented using Substrait, and automatically executed using 
> Py/Arrow.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16968) [C++] Expand Python-UDF support to Arrow Substrait

2022-07-02 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-16968:
---

 Summary: [C++] Expand Python-UDF support to Arrow Substrait
 Key: ARROW-16968
 URL: https://issues.apache.org/jira/browse/ARROW-16968
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yaron Gvili


Currently, Python-UDFs are supported in Arrow at a low-level of abstraction. 
This issue is about expanding this support to Arrow Substrait, which is at a 
higher-level. This is intended for the Py/Arrow support of the use case in 
which a Substrait plan with Python UDFs is manually expressed using Ibis, 
automatically represented using Substrait, and automatically executed using 
Py/Arrow.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16770) [C++] Arrow Substrait test fails with SIGSEGV, possibly due to gtest 1.11.0

2022-06-16 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555283#comment-17555283
 ] 

Yaron Gvili commented on ARROW-16770:
-

To make sure we're on the same page, my interpretation of this issue is that it 
shows there exist a non-contrived setup (which is whatever I happen to have and 
work with for some time now) where a normal Arrow test invocation leads to a 
not-so-trivial SIGSEGV. My intentions in filing this issue are to save time for 
whoever may run into this in the future and to try to find someone who knows 
how to fix it.

[~lidavidm], [~westonpace]: yes, as noted in the description, there is a mixup 
in my setup of GTest 1.10 and 1.11 (I think one was installed by pyarrow-dev 
and another by Ubuntu's apt) yet the point is I didn't do anything contrived to 
reach this setup, so it could happen to others.

Since this issue is currently not a blocker for me and I'm occupied with a 
large project, I'm fine with this staying on hold for the time being.

> [C++] Arrow Substrait test fails with SIGSEGV, possibly due to gtest 1.11.0
> ---
>
> Key: ARROW-16770
> URL: https://issues.apache.org/jira/browse/ARROW-16770
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Priority: Major
>
> I built Arrow using the instructions in the Python development page, under 
> the pyarrow-dev environment, and found that `arrow-substrait-substrait-test` 
> fails with SIGSEGV - see gdb session below. The same Arrow builds and runs 
> correctly on my system, outside of pyarrow-dev. I suspect this is due to 
> something different about gtest 1.11.0 as compared to gtest 1.10.0 based on 
> the following observations:
>  # The backtrace in the gdb session shows gtest 1.11.0 is used.
>  # The backtrace also shows the error is deep inside gtest, working on an 
> `UnorderedElementsAre` expectation.
>  # My system, outside pyarrow-dev, uses gtest 1.10.0.
>  
> {noformat}
> $ gdb --args ./release/arrow-substrait-substrait-test 
> GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
> Copyright (C) 2020 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
>     .
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from ./release/arrow-substrait-substrait-test...
> (gdb) run
> Starting program: 
> /mnt/user1/tscontract/github/rtpsw/arrow/cpp/build/debug/release/arrow-substrait-substrait-test
>  
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> [New Thread 0x741ff700 (LWP 115128)]
> Running main() from 
> /home/conda/feedstock_root/build_artifacts/gtest_1647154636757/work/googletest/src/gtest_main.cc
> [==] Running 33 tests from 3 test suites.
> [--] Global test environment set-up.
> [--] 4 tests from ExtensionIdRegistryTest
> [ RUN      ] ExtensionIdRegistryTest.RegisterTempTypes
> [       OK ] ExtensionIdRegistryTest.RegisterTempTypes (0 ms)
> [ RUN      ] ExtensionIdRegistryTest.RegisterTempFunctions
> [       OK ] ExtensionIdRegistryTest.RegisterTempFunctions (0 ms)
> [ RUN      ] ExtensionIdRegistryTest.RegisterNestedTypes
> [       OK ] ExtensionIdRegistryTest.RegisterNestedTypes (0 ms)
> [ RUN      ] ExtensionIdRegistryTest.RegisterNestedFunctions
> [       OK ] ExtensionIdRegistryTest.RegisterNestedFunctions (0 ms)
> [--] 4 tests from ExtensionIdRegistryTest (0 ms total)
> [--] 21 tests from Substrait
> [ RUN      ] Substrait.SupportedTypes
> [       OK ] Substrait.SupportedTypes (0 ms)
> [ RUN      ] Substrait.SupportedExtensionTypes
> [       OK ] Substrait.SupportedExtensionTypes (0 ms)
> [ RUN      ] Substrait.NamedStruct
> [       OK ] Substrait.NamedStruct (0 ms)
> [ RUN      ] Substrait.NoEquivalentArrowType
> [       OK ] Substrait.NoEquivalentArrowType (0 ms)
> [ RUN      ] Substrait.NoEquivalentSubstraitType
> [       OK ] Substrait.NoEquivalentSubstraitType (0 ms)
> [ RUN      ] Substrait.SupportedLiterals
> [       OK ] Substrait.SupportedLiterals (1 ms)
> [ RUN      ] Substrait.CannotDeserializeLiteral
> [       OK ] Substrait.CannotDeserializeLiteral (0 ms)
> [ RUN      ] Substrait.FieldRefRoundTrip
> [       OK ] 

[jira] [Comment Edited] (ARROW-16823) [C++] Arrow Substrait enhancements for UDF

2022-06-16 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554982#comment-17554982
 ] 

Yaron Gvili edited comment on ARROW-16823 at 6/16/22 9:26 AM:
--

[~vibhatha], before I address your points, I think it would help that I write 
my view of how nested registries would be used, in general and in the context 
of UDFs.

In general, a nested registry is created and passed to a new scope which is 
free to modify it without affecting its parent registries. This can be thought 
of as passing-by-value, as long as parent registries remain constant while the 
new scope is alive, and indeed this is the recommended way of using nested 
registries. With this way of use, registry nesting has the following desirable 
properties:
 # Value-semantics: modification are restricted to the passed "value".
 # Recursive: repeated nesting works as expected.
 # Thread-safety: a nested registry can be safely passed to a thread.

In the context of UDFs, a nested registry is created for temporarily 
registering UDFs for the lifetime of a separate scope in which they will be 
used. In a typical use case, this scope is for deserialization and execution of 
a Substrait plan. In this use case, one creates nested (function and 
extension-id) registries and uses them to deserialize a Substrait plan, 
register UDFs for this plan, and execute the plan, then drops the nested 
registries.

It is no accident that the above properties make nested registries powerful 
enough to cleanly support much more complex future use cases. I envision 
modular Substrait plans:
 * a Substrait plan can be shared (from author to its users)
 * shared Substrait plans can be gathered in libraries/modules
 * a Substrait plan can include invocations of other shared Substrait plans

and that they will become important for boosting user productivity with Arrow.

While this is my long-term vision, the current issue is about preparation for 
upcoming end-to-end Ibis/Ibis-Substrait/PyArrow support for Python-UDFs that 
I'm currently working on.

Now to your points.

> I think for general usage of UDFs we could also keep a temporary registry 
> which is in the scope of the application and it get destroyed when the 
> application ends it's life.

A single registry for UDF would go against the design goal of modularity. It 
would require support for unregistration, which is error-prone. See also the 
discussion in ARROW-16211.

> Thinking about a simple example to reflect the usage.

This is actually an example more complex than the 
single-Substrait-plan-with-UDFs one that I described above.

> Visually GFR->TF1, GFR->TF2 or GFR->TF1->TF2 right?

I think the right organization for your example is that each nested registry 
has the global one as its parent. Each of the 3 stages has its own set of UDFs 
to register.

> What if TF1 destroyed, that means TF2 get detached from the GFR, are we going 
> to correct that relationship when we remove TF1. Are we planning to handle 
> this or is this irrelevant?

When following the recommended way of using nested registries that I described 
above, even in a case of repeated nesting like GFR->TF1->TF2, it is incorrect 
to even modify, let alone drop, TF1 while TF2 is alive.

> Considering the practical usage, I assume what should happen is, when I ask 
> for function `f1` to be called, it should scan through the global, then go 
> level by level on the scoped and retrieve the function once located. Is this 
> right?

It's the other way around. In the case of GFR->TF1->TF2, the function is first 
looked up in TF2, then in TF1, and finally in GFR. This way, modification to 
TF2 take precedence, which is what one expects from value-semantics.

>  For Python UDF users or R UDF users, do we have to do anything special where 
> we expose the FunctionRegistry (I guess we don't have to, but curious)...

Eventually, the end-user should typically just invoke a single function to 
execute a Substrait plan. If the Substrait plan has UDFs, their registration 
into fresh nested registries will be automated (I have this locally worked out 
for Python-UDFs, and presumably R-UDFs should work out similarly). The 
facilities we discuss here are for developers and should eventually be 
encapsulated from the end-user.

> In addition, I have this general question, depending on the usage, should we 
> keep a separate temporary function registry for Substrait UDF users, plain 
> UDF users (directly using Arrow), in future there could be similar cases 
> where we need to support...

As described above, the recommended way is to create nested registries for a 
scope, not for a class-of-use (like Substrait-UDF-use and plain-UDF-use).

> Diving a little deep into the parallel case, we are going to have separate 
> scoped registry for each instance. I would say that is efficient for 
> communication and there is no sync issues. 

[jira] [Comment Edited] (ARROW-16823) [C++] Arrow Substrait enhancements for UDF

2022-06-16 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554982#comment-17554982
 ] 

Yaron Gvili edited comment on ARROW-16823 at 6/16/22 9:21 AM:
--

[~vibhatha], before I address your points, I think it would help that I write 
my view of how nested registries would be used, in general and in the context 
of UDFs.

In general, a nested registry is created and passed to a new scope which is 
free to modify it without affecting its parent registries. This can be thought 
of as passing-by-value, as long as parent registries remain constant while the 
new scope is alive, and indeed this is the recommended way of using nested 
registries. With this way of use, registry nesting has the following desirable 
properties:
 # Value-semantics: modification are restricted to the passed "value".
 # Recursive: repeated nesting works as expected.
 # Thread-safety: a nested registry can be safely passed to a thread.

In the context of UDFs, a nested registry is created for temporarily 
registering UDFs for the lifetime of a separate scope in which they will be 
used. In a typical use case, this scope is for deserialization and execution of 
a Substrait plan. In this use case, one creates nested (function and 
extension-id) registries and uses them to deserialize a Substrait plan, 
register UDFs for this plan, and execute the plan, then drops the nested 
registries.

It is no accident that the above properties make nested registries powerful 
enough to cleanly support much more complex future use cases. I envision 
modular Substrait plans:
 * a Substrait plan can be shared (from author to its users)
 * shared Substrait plans can be gathered in libraries/modules
 * a Substrait plan can include invocations of other shared Substrait plans

and that they will become important for boosting user productivity with Arrow.

While this is my long-term vision, the current issue is about preparation for 
upcoming end-to-end Ibis/Ibis-Substrait/PyArrow support for Python-UDFs that 
I'm currently working on.

Now to your points.

> I think for general usage of UDFs we could also keep a temporary registry 
> which is in the scope of the application and it get destroyed when the 
> application ends it's life.

A single registry for UDF would go against the design goal of modularity. It 
would require support for unregistration, which is error-prone. See also the 
discussion in ARROW-16211.

> Thinking about a simple example to reflect the usage.

This is actually an example more complex than the 
single-Substrait-plan-with-UDFs one that I described above.

> Visually GFR->TF1, GFR->TF2 or GFR->TF1->TF2 right?

I think the right organization for your example is that each nested registry 
has the global one as its parent. Each of the 3 stages has its own set of UDFs 
to register.

> What if TF1 destroyed, that means TF2 get detached from the GFR, are we going 
> to correct that relationship when we remove TF1. Are we planning to handle 
> this or is this irrelevant?

When following the recommended way of using nested registries that I described 
above, even in a case of repeated nesting like GFR->TF1->TF2, it is incorrect 
to even modify, let alone drop, TF1 while TF2 is alive.

> Considering the practical usage, I assume what should happen is, when I ask 
> for function `f1` to be called, it should scan through the global, then go 
> level by level on the scoped and retrieve the function once located. Is this 
> right?

It's the other way around. In the case of GFR->TF1->TF2, the function is first 
looked up in TF2, then in TF1, and finally in GFR. This way, modification to 
TF2 take precedence, which is what one expects from value-semantics.

>  For Python UDF users or R UDF users, do we have to do anything special where 
> we expose the FunctionRegistry (I guess we don't have to, but curious)...

Eventually, the end-user should typically just invoke a single function to 
execute a Substrait plan. If the Substrait plan has UDFs, their registration 
into fresh nested registries will be automated (I have this locally worked out 
for Python-UDFs). The facilities we discuss here are for developers and should 
eventually be encapsulated from the end-user.

> In addition, I have this general question, depending on the usage, should we 
> keep a separate temporary function registry for Substrait UDF users, plain 
> UDF users (directly using Arrow), in future there could be similar cases 
> where we need to support...

As described above, the recommended way is to create nested registries for a 
scope, not for a class-of-use (like Substrait-UDF-use and plain-UDF-use).

> Diving a little deep into the parallel case, we are going to have separate 
> scoped registry for each instance. I would say that is efficient for 
> communication and there is no sync issues. May be the intended use is 
> multiple plans with 

[jira] [Commented] (ARROW-16823) [C++] Arrow Substrait enhancements for UDF

2022-06-16 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554982#comment-17554982
 ] 

Yaron Gvili commented on ARROW-16823:
-

[~vibhatha], before I address your points, I think it would help that I write 
my view of how nested registries would be used, in general and in the context 
of UDFs.

In general, a nested registry is created and passed to a new scope which is 
free to modify it without affecting its parent registries. This can be thought 
of as passing-by-value, as long as parent registries remain constant while the 
new scope is alive, and indeed this is the recommended way of using nested 
registries. With this way of use, registry nesting has the following desirable 
properties:
 # Value-semantics: modification are restricted to the passed "value".
 # Recursive: repeated nesting works as expected.
 # Thread-safety: a nested registry can be safely passed to a thread.

In the context of UDFs, a nested registry is created for temporarily 
registering UDFs for the lifetime of a separate scope in which they will be 
used. In a typical use case, this scope is for deserialization and execution of 
a Substrait plan. In this use case, one creates nested (function and 
extension-id) registries and use them to deserialize a Substrait plan, register 
UDFs for this plan, and execute the plan, then drops the nested registries.

It is no accident that the above properties make nested registries powerful 
enough to cleanly support much more complex future use cases. I envision 
modular Substrait plans:
 * a Substrait plan can be shared (from author to its users)
 * shared Substrait plans can be gathered in libraries/modules
 * a Substrait plan can include invocations of other shared Substrait plans

and that they will become important for boosting user productivity with Arrow.

While this is my long-term vision, the current issue is about preparation for 
upcoming end-to-end Ibis/Ibis-Substrait/PyArrow support for Python-UDFs that 
I'm currently working on.

Now to your points.

> I think for general usage of UDFs we could also keep a temporary registry 
> which is in the scope of the application and it get destroyed when the 
> application ends it's life.

A single registry for UDF would go against the design goal of modularity. It 
would require support for unregistration, which is error-prone. See also the 
discussion in ARROW-16211.

> Thinking about a simple example to reflect the usage.

This is actually an example more complex than the 
single-Substrait-plan-with-UDFs one that I described above.

> Visually GFR->TF1, GFR->TF2 or GFR->TF1->TF2 right?

I think the right organization for your example is that each nested registry 
has the global one as its parent. Each of the 3 stages has its own set of UDFs 
to register.

> What if TF1 destroyed, that means TF2 get detached from the GFR, are we going 
> to correct that relationship when we remove TF1. Are we planning to handle 
> this or is this irrelevant?

When following the recommended way of using nested registries that I described 
above, even in a case of repeated nesting like GFR->TF1->TF2, it is incorrect 
to even modify, let alone drop, TF1 while TF2 is alive.

> Considering the practical usage, I assume what should happen is, when I ask 
> for function `f1` to be called, it should scan through the global, then go 
> level by level on the scoped and retrieve the function once located. Is this 
> right?

It's the other way around. In the case of GFR->TF1->TF2, the function is first 
looked up in TF2, then in TF1, and finally in GFR. This way, modification to 
TF2 take precedence, which is what one expects from value-semantics.

>  For Python UDF users or R UDF users, do we have to do anything special where 
> we expose the FunctionRegistry (I guess we don't have to, but curious)...

Eventually, the end-user should typically just invoke a single function to 
execute a Substrait plan. If the Substrait plan has UDFs, their registration 
into fresh nested registries will be automated (I have this locally worked out 
for Python-UDFs). The facilities we discuss here are for developers and should 
eventually be encapsulated from the end-user.

> In addition, I have this general question, depending on the usage, should we 
> keep a separate temporary function registry for Substrait UDF users, plain 
> UDF users (directly using Arrow), in future there could be similar cases 
> where we need to support...

As described above, the recommended way is to create nested registries for a 
scope, not for a class-of-use (like Substrait-UDF-use and plain-UDF-use).

> Diving a little deep into the parallel case, we are going to have separate 
> scoped registry for each instance. I would say that is efficient for 
> communication and there is no sync issues. May be the intended use is 
> multiple plans with non-overlapping functions? ...

A thread is a 

[jira] [Commented] (ARROW-16823) [C++] Arrow Substrait enhancements for UDF

2022-06-15 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554780#comment-17554780
 ] 

Yaron Gvili commented on ARROW-16823:
-

Some design rationale:
 * The scoped (or nested) registries are intended for temporary registration of 
functions. A typical use case for this is execution of a Substrait plan that 
includes UDFs (TBD). The UDFs get registered using a scoped 
extension-id-registry and a scoped function-registry, which are used during 
plan deserialization and execution, and thereafter can be dropped without ever 
affecting the default/global registries. This can even be done for multiple 
plans in parallel, each using separate scoped registries.
 * The registration of external functions is intended for UDFs provided outside 
of the Substrait plan they are used in. This is one way to plug in UDFs. 
Another way is by embedding UDFs within the plan (TBD).

> [C++] Arrow Substrait enhancements for UDF
> --
>
> Key: ARROW-16823
> URL: https://issues.apache.org/jira/browse/ARROW-16823
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> The enhancements include support for:
>  * user-provided extension-id-registries and function-registries (for scoped 
> registries)
>  * registering a function (with an Id) external to the plan
>  * a dataset-write-sink (for convenience and multiple outputting)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16823) [C++] Arrow Substrait enhancements for UDF

2022-06-14 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-16823:

Description: 
The enhancements include support for:
 * user-provided extension-id-registries and function-registries (for scoped 
registries)
 * registering a function (with an Id) external to the plan
 * a dataset-write-sink (for convenience and multiple outputting)

  was:
The enhancements include support for:
 * user-provided extension-id-registries (for scoped registries)
 * registering a function (with an Id) external to the plan
 * a dataset-write-sink (for convenience and multiple outputting)


> [C++] Arrow Substrait enhancements for UDF
> --
>
> Key: ARROW-16823
> URL: https://issues.apache.org/jira/browse/ARROW-16823
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The enhancements include support for:
>  * user-provided extension-id-registries and function-registries (for scoped 
> registries)
>  * registering a function (with an Id) external to the plan
>  * a dataset-write-sink (for convenience and multiple outputting)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16823) [C++] Arrow Substrait enhancements for UDF

2022-06-14 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-16823:

Description: 
The enhancements include support for:
 * user-provided extension-id-registries (for scoped registries)
 * registering a function (with an Id) external to the plan
 * a dataset-write-sink (for convenience and multiple outputting)

  was:
The enhancements include support for:
 * user-provided extension-id-registries (for scoped registries)
 * registering a function (with an Id) external to the plan
 * a dataset-write-sink (for convenience)


> [C++] Arrow Substrait enhancements for UDF
> --
>
> Key: ARROW-16823
> URL: https://issues.apache.org/jira/browse/ARROW-16823
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The enhancements include support for:
>  * user-provided extension-id-registries (for scoped registries)
>  * registering a function (with an Id) external to the plan
>  * a dataset-write-sink (for convenience and multiple outputting)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16823) [C++] Arrow Substrait enhancements for UDF

2022-06-13 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-16823:
---

 Summary: [C++] Arrow Substrait enhancements for UDF
 Key: ARROW-16823
 URL: https://issues.apache.org/jira/browse/ARROW-16823
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yaron Gvili


The enhancements include support for:
 * user-provided extension-id-registries (for scoped registries)
 * registering a function (with an Id) external to the plan
 * a dataset-write-sink (for convenience)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16823) [C++] Arrow Substrait enhancements for UDF

2022-06-13 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili reassigned ARROW-16823:
---

Assignee: Yaron Gvili

> [C++] Arrow Substrait enhancements for UDF
> --
>
> Key: ARROW-16823
> URL: https://issues.apache.org/jira/browse/ARROW-16823
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> The enhancements include support for:
>  * user-provided extension-id-registries (for scoped registries)
>  * registering a function (with an Id) external to the plan
>  * a dataset-write-sink (for convenience)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16770) [C++] Arrow Substrait test fails with SIGSEGV, possibly due to gtest 1.11.0

2022-06-12 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553274#comment-17553274
 ] 

Yaron Gvili commented on ARROW-16770:
-

[~lidavidm], making sure this is noticed. I'm not sure who would be a good 
person to take this.

> [C++] Arrow Substrait test fails with SIGSEGV, possibly due to gtest 1.11.0
> ---
>
> Key: ARROW-16770
> URL: https://issues.apache.org/jira/browse/ARROW-16770
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Priority: Major
>
> I built Arrow using the instructions in the Python development page, under 
> the pyarrow-dev environment, and found that `arrow-substrait-substrait-test` 
> fails with SIGSEGV - see gdb session below. The same Arrow builds and runs 
> correctly on my system, outside of pyarrow-dev. I suspect this is due to 
> something different about gtest 1.11.0 as compared to gtest 1.10.0 based on 
> the following observations:
>  # The backtrace in the gdb session shows gtest 1.11.0 is used.
>  # The backtrace also shows the error is deep inside gtest, working on an 
> `UnorderedElementsAre` expectation.
>  # My system, outside pyarrow-dev, uses gtest 1.10.0.
>  
> {noformat}
> $ gdb --args ./release/arrow-substrait-substrait-test 
> GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
> Copyright (C) 2020 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
>     .
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from ./release/arrow-substrait-substrait-test...
> (gdb) run
> Starting program: 
> /mnt/user1/tscontract/github/rtpsw/arrow/cpp/build/debug/release/arrow-substrait-substrait-test
>  
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> [New Thread 0x741ff700 (LWP 115128)]
> Running main() from 
> /home/conda/feedstock_root/build_artifacts/gtest_1647154636757/work/googletest/src/gtest_main.cc
> [==] Running 33 tests from 3 test suites.
> [--] Global test environment set-up.
> [--] 4 tests from ExtensionIdRegistryTest
> [ RUN      ] ExtensionIdRegistryTest.RegisterTempTypes
> [       OK ] ExtensionIdRegistryTest.RegisterTempTypes (0 ms)
> [ RUN      ] ExtensionIdRegistryTest.RegisterTempFunctions
> [       OK ] ExtensionIdRegistryTest.RegisterTempFunctions (0 ms)
> [ RUN      ] ExtensionIdRegistryTest.RegisterNestedTypes
> [       OK ] ExtensionIdRegistryTest.RegisterNestedTypes (0 ms)
> [ RUN      ] ExtensionIdRegistryTest.RegisterNestedFunctions
> [       OK ] ExtensionIdRegistryTest.RegisterNestedFunctions (0 ms)
> [--] 4 tests from ExtensionIdRegistryTest (0 ms total)
> [--] 21 tests from Substrait
> [ RUN      ] Substrait.SupportedTypes
> [       OK ] Substrait.SupportedTypes (0 ms)
> [ RUN      ] Substrait.SupportedExtensionTypes
> [       OK ] Substrait.SupportedExtensionTypes (0 ms)
> [ RUN      ] Substrait.NamedStruct
> [       OK ] Substrait.NamedStruct (0 ms)
> [ RUN      ] Substrait.NoEquivalentArrowType
> [       OK ] Substrait.NoEquivalentArrowType (0 ms)
> [ RUN      ] Substrait.NoEquivalentSubstraitType
> [       OK ] Substrait.NoEquivalentSubstraitType (0 ms)
> [ RUN      ] Substrait.SupportedLiterals
> [       OK ] Substrait.SupportedLiterals (1 ms)
> [ RUN      ] Substrait.CannotDeserializeLiteral
> [       OK ] Substrait.CannotDeserializeLiteral (0 ms)
> [ RUN      ] Substrait.FieldRefRoundTrip
> [       OK ] Substrait.FieldRefRoundTrip (1 ms)
> [ RUN      ] Substrait.RecursiveFieldRef
> [       OK ] Substrait.RecursiveFieldRef (0 ms)
> [ RUN      ] Substrait.FieldRefsInExpressions
> [       OK ] Substrait.FieldRefsInExpressions (0 ms)
> [ RUN      ] Substrait.CallSpecialCaseRoundTrip
> [       OK ] Substrait.CallSpecialCaseRoundTrip (0 ms)
> [ RUN      ] Substrait.CallExtensionFunction
> [       OK ] Substrait.CallExtensionFunction (0 ms)
> [ RUN      ] Substrait.ReadRel
> Thread 1 "arrow-substrait" received signal SIGSEGV, Segmentation fault.
> 0x555b02e6 in 
> testing::internal::MatcherBase std::char_traits, std::allocator > const&>::MatchAndExplain 
> (listener=0x7fffb3a0, x=..., 
>     this=) at 
> 

[jira] [Commented] (ARROW-16811) [C++] Remove default exec context from Expression::Bind

2022-06-10 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552993#comment-17552993
 ] 

Yaron Gvili commented on ARROW-16811:
-

You are referring to this post. This limited-bind is not ideal, though it can 
be useful as an intermediate solution in places in the code that cannot be 
easily changed to a work with a non-default ExecContext. I imagine this could 
be the case in some user-facing APIs that currently do not take an ExecContext, 
and eventually defaults to the global function registry (perhaps examples exist 
in the dataset package?). In such cases, there are two options to consider: 
either break user code to force it to provide an ExecContext, or keep user-code 
intact but fail on runtime when an expression gets bound in a non-safe way. The 
latter one is what I wanted to draw attention to.

> [C++] Remove default exec context from Expression::Bind
> ---
>
> Key: ARROW-16811
> URL: https://issues.apache.org/jira/browse/ARROW-16811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This came up in https://github.com/apache/arrow/pull/13355.
> It is maybe not very intuitive that Expression::Bind would require an 
> ExecContext and so we never provided one.  However, when binding expressions 
> we need to lookup kernels, and that requires a function registry.  Defaulting 
> to default_exec_context is something that should be done at a higher level 
> and so we should not allow ExecContext to be omitted when calling Bind.
> Furthermore, [~rtpsw] has suggested that we might want to split 
> Expression::Bind into two variants.  One which requires an ExecContext and 
> one which does not (but fails if it encounters a "call").



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16796) [C++] Fix bad defaulting of ExecContext argument

2022-06-10 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552569#comment-17552569
 ] 

Yaron Gvili commented on ARROW-16796:
-

If coding safety is a major concern here (IMHO it is), I'd suggest that in the 
longer-term Arrow code should distinguish between simplification of expressions 
with and without functions/execution, where only the former requires an 
ExecContext whereas only the latter will fail if a function exists in the 
expression. Perhaps the simplest, though likely not ideal, code-change for this 
is by defaulting ExecContext to an implementation that fails.

The purpose of the PR is just to fix in the short-term. Follow-up issues can be 
created for what remains.

> [C++] Fix bad defaulting of ExecContext argument
> 
>
> Key: ARROW-16796
> URL: https://issues.apache.org/jira/browse/ARROW-16796
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In several places in Arrow code, invocations of Expression::Bind() default 
> the ExecContext argument. This leads to the default function registry being 
> used in expression manipulations, and this becomes a problem when the user 
> wishes to use a non-default function registry, e.g., when passing one to the 
> ExecContext of an ExecPlan, which is how I discovered this issue. The 
> problematic places I found for such Expression::Bind() invocation are:
>  * cpp/src/arrow/dataset/file_parquet.cc
>  * cpp/src/arrow/dataset/scanner.cc
>  * cpp/src/arrow/compute/exec/project_node.cc
>  * cpp/src/arrow/compute/exec/hash_join_node.cc
>  * cpp/src/arrow/compute/exec/filter_node.cc
> There are also other places in test and benchmark code (grep for 'Bind()').
> Another case of bad defaulting of an ExecContext argument is in 
> Inequality::simplifies_to in cpp/src/compute/exec/expression.cc where a fresh 
> ExecContext is created, instead of being received from the caller, and passed 
> to BindNonRecursive.
> I'd argue that an ExecContext variable should not be allowed to default, 
> except perhaps in the highest-level/user-facing APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16796) [C++] Fix bad defaulting of ExecContext argument

2022-06-10 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552568#comment-17552568
 ] 

Yaron Gvili commented on ARROW-16796:
-

Copying [Weston Pace's 
post|https://github.com/apache/arrow/pull/13355#issuecomment-1151679039]:

Good catch. I wonder if we should remove the default argument to bind entirely 
(it would look something like 
[westonpace@{{{}c9ae1dd{}}}|https://github.com/westonpace/arrow/commit/c9ae1dd6a0857af69e48a95ec76480f4c466791e]
 ). Looks like there are only a few other non-test spots we call bind.
 * In the parquet reader we convert statistics into expressions and bind them 
to the schema. These expressions will only use min/max and it's only really for 
simplification and not execution so we're probably ok.
 * The scanner has a number of methods that create exec plans (this is the 
"lightweight producer" half of the scanner). We could arguably add an 
ExecContext to scan options but I think it would better to start phasing out 
this half of the scanner in favor of direct use of exec plans instead.

> [C++] Fix bad defaulting of ExecContext argument
> 
>
> Key: ARROW-16796
> URL: https://issues.apache.org/jira/browse/ARROW-16796
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In several places in Arrow code, invocations of Expression::Bind() default 
> the ExecContext argument. This leads to the default function registry being 
> used in expression manipulations, and this becomes a problem when the user 
> wishes to use a non-default function registry, e.g., when passing one to the 
> ExecContext of an ExecPlan, which is how I discovered this issue. The 
> problematic places I found for such Expression::Bind() invocation are:
>  * cpp/src/arrow/dataset/file_parquet.cc
>  * cpp/src/arrow/dataset/scanner.cc
>  * cpp/src/arrow/compute/exec/project_node.cc
>  * cpp/src/arrow/compute/exec/hash_join_node.cc
>  * cpp/src/arrow/compute/exec/filter_node.cc
> There are also other places in test and benchmark code (grep for 'Bind()').
> Another case of bad defaulting of an ExecContext argument is in 
> Inequality::simplifies_to in cpp/src/compute/exec/expression.cc where a fresh 
> ExecContext is created, instead of being received from the caller, and passed 
> to BindNonRecursive.
> I'd argue that an ExecContext variable should not be allowed to default, 
> except perhaps in the highest-level/user-facing APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16796) [C++] Fix bad defaulting of ExecContext argument

2022-06-09 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili reassigned ARROW-16796:
---

Assignee: Yaron Gvili

> [C++] Fix bad defaulting of ExecContext argument
> 
>
> Key: ARROW-16796
> URL: https://issues.apache.org/jira/browse/ARROW-16796
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> In several places in Arrow code, invocations of Expression::Bind() default 
> the ExecContext argument. This leads to the default function registry being 
> used in expression manipulations, and this becomes a problem when the user 
> wishes to use a non-default function registry, e.g., when passing one to the 
> ExecContext of an ExecPlan, which is how I discovered this issue. The 
> problematic places I found for such Expression::Bind() invocation are:
>  * cpp/src/arrow/dataset/file_parquet.cc
>  * cpp/src/arrow/dataset/scanner.cc
>  * cpp/src/arrow/compute/exec/project_node.cc
>  * cpp/src/arrow/compute/exec/hash_join_node.cc
>  * cpp/src/arrow/compute/exec/filter_node.cc
> There are also other places in test and benchmark code (grep for 'Bind()').
> Another case of bad defaulting of an ExecContext argument is in 
> Inequality::simplifies_to in cpp/src/compute/exec/expression.cc where a fresh 
> ExecContext is created, instead of being received from the caller, and passed 
> to BindNonRecursive.
> I'd argue that an ExecContext variable should not be allowed to default, 
> except perhaps in the highest-level/user-facing APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16796) [C++] Fix bad defaulting of ExecContext argument

2022-06-09 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-16796:

Summary: [C++] Fix bad defaulting of ExecContext argument  (was: [C++] Bad 
defaulting of ExecContext arg)

> [C++] Fix bad defaulting of ExecContext argument
> 
>
> Key: ARROW-16796
> URL: https://issues.apache.org/jira/browse/ARROW-16796
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Priority: Major
>
> In several places in Arrow code, invocations of Expression::Bind() default 
> the ExecContext argument. This leads to the default function registry being 
> used in expression manipulations, and this becomes a problem when the user 
> wishes to use a non-default function registry, e.g., when passing one to the 
> ExecContext of an ExecPlan, which is how I discovered this issue. The 
> problematic places I found for such Expression::Bind() invocation are:
>  * cpp/src/arrow/dataset/file_parquet.cc
>  * cpp/src/arrow/dataset/scanner.cc
>  * cpp/src/arrow/compute/exec/project_node.cc
>  * cpp/src/arrow/compute/exec/hash_join_node.cc
>  * cpp/src/arrow/compute/exec/filter_node.cc
> There are also other places in test and benchmark code (grep for 'Bind()').
> Another case of bad defaulting of an ExecContext argument is in 
> Inequality::simplifies_to in cpp/src/compute/exec/expression.cc where a fresh 
> ExecContext is created, instead of being received from the caller, and passed 
> to BindNonRecursive.
> I'd argue that an ExecContext variable should not be allowed to default, 
> except perhaps in the highest-level/user-facing APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16796) [C++] Bad defaulting of ExecContext arg

2022-06-09 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-16796:
---

 Summary: [C++] Bad defaulting of ExecContext arg
 Key: ARROW-16796
 URL: https://issues.apache.org/jira/browse/ARROW-16796
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Yaron Gvili


In several places in Arrow code, invocations of Expression::Bind() default the 
ExecContext argument. This leads to the default function registry being used in 
expression manipulations, and this becomes a problem when the user wishes to 
use a non-default function registry, e.g., when passing one to the ExecContext 
of an ExecPlan, which is how I discovered this issue. The problematic places I 
found for such Expression::Bind() invocation are:
 * cpp/src/arrow/dataset/file_parquet.cc
 * cpp/src/arrow/dataset/scanner.cc
 * cpp/src/arrow/compute/exec/project_node.cc
 * cpp/src/arrow/compute/exec/hash_join_node.cc
 * cpp/src/arrow/compute/exec/filter_node.cc

There are also other places in test and benchmark code (grep for 'Bind()').

Another case of bad defaulting of an ExecContext argument is in 
Inequality::simplifies_to in cpp/src/compute/exec/expression.cc where a fresh 
ExecContext is created, instead of being received from the caller, and passed 
to BindNonRecursive.

I'd argue that an ExecContext variable should not be allowed to default, except 
perhaps in the highest-level/user-facing APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16770) [C++] Arrow Substrait test fails with SIGSEGV, possibly due to gtest 1.11.0

2022-06-09 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-16770:

Summary: [C++] Arrow Substrait test fails with SIGSEGV, possibly due to 
gtest 1.11.0  (was: Arrow Substrait test fails with SIGSEGV, possibly due to 
gtest 1.11.0)

> [C++] Arrow Substrait test fails with SIGSEGV, possibly due to gtest 1.11.0
> ---
>
> Key: ARROW-16770
> URL: https://issues.apache.org/jira/browse/ARROW-16770
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Priority: Major
>
> I built Arrow using the instructions in the Python development page, under 
> the pyarrow-dev environment, and found that `arrow-substrait-substrait-test` 
> fails with SIGSEGV - see gdb session below. The same Arrow builds and runs 
> correctly on my system, outside of pyarrow-dev. I suspect this is due to 
> something different about gtest 1.11.0 as compared to gtest 1.10.0 based on 
> the following observations:
>  # The backtrace in the gdb session shows gtest 1.11.0 is used.
>  # The backtrace also shows the error is deep inside gtest, working on an 
> `UnorderedElementsAre` expectation.
>  # My system, outside pyarrow-dev, uses gtest 1.10.0.
>  
> {noformat}
> $ gdb --args ./release/arrow-substrait-substrait-test 
> GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
> Copyright (C) 2020 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
>     .
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from ./release/arrow-substrait-substrait-test...
> (gdb) run
> Starting program: 
> /mnt/user1/tscontract/github/rtpsw/arrow/cpp/build/debug/release/arrow-substrait-substrait-test
>  
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> [New Thread 0x741ff700 (LWP 115128)]
> Running main() from 
> /home/conda/feedstock_root/build_artifacts/gtest_1647154636757/work/googletest/src/gtest_main.cc
> [==] Running 33 tests from 3 test suites.
> [--] Global test environment set-up.
> [--] 4 tests from ExtensionIdRegistryTest
> [ RUN      ] ExtensionIdRegistryTest.RegisterTempTypes
> [       OK ] ExtensionIdRegistryTest.RegisterTempTypes (0 ms)
> [ RUN      ] ExtensionIdRegistryTest.RegisterTempFunctions
> [       OK ] ExtensionIdRegistryTest.RegisterTempFunctions (0 ms)
> [ RUN      ] ExtensionIdRegistryTest.RegisterNestedTypes
> [       OK ] ExtensionIdRegistryTest.RegisterNestedTypes (0 ms)
> [ RUN      ] ExtensionIdRegistryTest.RegisterNestedFunctions
> [       OK ] ExtensionIdRegistryTest.RegisterNestedFunctions (0 ms)
> [--] 4 tests from ExtensionIdRegistryTest (0 ms total)
> [--] 21 tests from Substrait
> [ RUN      ] Substrait.SupportedTypes
> [       OK ] Substrait.SupportedTypes (0 ms)
> [ RUN      ] Substrait.SupportedExtensionTypes
> [       OK ] Substrait.SupportedExtensionTypes (0 ms)
> [ RUN      ] Substrait.NamedStruct
> [       OK ] Substrait.NamedStruct (0 ms)
> [ RUN      ] Substrait.NoEquivalentArrowType
> [       OK ] Substrait.NoEquivalentArrowType (0 ms)
> [ RUN      ] Substrait.NoEquivalentSubstraitType
> [       OK ] Substrait.NoEquivalentSubstraitType (0 ms)
> [ RUN      ] Substrait.SupportedLiterals
> [       OK ] Substrait.SupportedLiterals (1 ms)
> [ RUN      ] Substrait.CannotDeserializeLiteral
> [       OK ] Substrait.CannotDeserializeLiteral (0 ms)
> [ RUN      ] Substrait.FieldRefRoundTrip
> [       OK ] Substrait.FieldRefRoundTrip (1 ms)
> [ RUN      ] Substrait.RecursiveFieldRef
> [       OK ] Substrait.RecursiveFieldRef (0 ms)
> [ RUN      ] Substrait.FieldRefsInExpressions
> [       OK ] Substrait.FieldRefsInExpressions (0 ms)
> [ RUN      ] Substrait.CallSpecialCaseRoundTrip
> [       OK ] Substrait.CallSpecialCaseRoundTrip (0 ms)
> [ RUN      ] Substrait.CallExtensionFunction
> [       OK ] Substrait.CallExtensionFunction (0 ms)
> [ RUN      ] Substrait.ReadRel
> Thread 1 "arrow-substrait" received signal SIGSEGV, Segmentation fault.
> 0x555b02e6 in 
> testing::internal::MatcherBase std::char_traits, std::allocator > const&>::MatchAndExplain 
> (listener=0x7fffb3a0, x=..., 
>     this=) at 
> 

[jira] [Updated] (ARROW-16770) Arrow Substrait test fails with SIGSEGV, possibly due to gtest 1.11.0

2022-06-07 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-16770:

Description: 
I built Arrow using the instructions in the Python development page, under the 
pyarrow-dev environment, and found that `arrow-substrait-substrait-test` fails 
with SIGSEGV - see gdb session below. The same Arrow builds and runs correctly 
on my system, outside of pyarrow-dev. I suspect this is due to something 
different about gtest 1.11.0 as compared to gtest 1.10.0 based on the following 
observations:
 # The backtrace in the gdb session shows gtest 1.11.0 is used.
 # The backtrace also shows the error is deep inside gtest, working on an 
`UnorderedElementsAre` expectation.
 # My system, outside pyarrow-dev, uses gtest 1.10.0.

 
{noformat}
$ gdb --args ./release/arrow-substrait-substrait-test 
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
    .
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./release/arrow-substrait-substrait-test...
(gdb) run
Starting program: 
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/build/debug/release/arrow-substrait-substrait-test
 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x741ff700 (LWP 115128)]
Running main() from 
/home/conda/feedstock_root/build_artifacts/gtest_1647154636757/work/googletest/src/gtest_main.cc
[==] Running 33 tests from 3 test suites.
[--] Global test environment set-up.
[--] 4 tests from ExtensionIdRegistryTest
[ RUN      ] ExtensionIdRegistryTest.RegisterTempTypes
[       OK ] ExtensionIdRegistryTest.RegisterTempTypes (0 ms)
[ RUN      ] ExtensionIdRegistryTest.RegisterTempFunctions
[       OK ] ExtensionIdRegistryTest.RegisterTempFunctions (0 ms)
[ RUN      ] ExtensionIdRegistryTest.RegisterNestedTypes
[       OK ] ExtensionIdRegistryTest.RegisterNestedTypes (0 ms)
[ RUN      ] ExtensionIdRegistryTest.RegisterNestedFunctions
[       OK ] ExtensionIdRegistryTest.RegisterNestedFunctions (0 ms)
[--] 4 tests from ExtensionIdRegistryTest (0 ms total)
[--] 21 tests from Substrait
[ RUN      ] Substrait.SupportedTypes
[       OK ] Substrait.SupportedTypes (0 ms)
[ RUN      ] Substrait.SupportedExtensionTypes
[       OK ] Substrait.SupportedExtensionTypes (0 ms)
[ RUN      ] Substrait.NamedStruct
[       OK ] Substrait.NamedStruct (0 ms)
[ RUN      ] Substrait.NoEquivalentArrowType
[       OK ] Substrait.NoEquivalentArrowType (0 ms)
[ RUN      ] Substrait.NoEquivalentSubstraitType
[       OK ] Substrait.NoEquivalentSubstraitType (0 ms)
[ RUN      ] Substrait.SupportedLiterals
[       OK ] Substrait.SupportedLiterals (1 ms)
[ RUN      ] Substrait.CannotDeserializeLiteral
[       OK ] Substrait.CannotDeserializeLiteral (0 ms)
[ RUN      ] Substrait.FieldRefRoundTrip
[       OK ] Substrait.FieldRefRoundTrip (1 ms)
[ RUN      ] Substrait.RecursiveFieldRef
[       OK ] Substrait.RecursiveFieldRef (0 ms)
[ RUN      ] Substrait.FieldRefsInExpressions
[       OK ] Substrait.FieldRefsInExpressions (0 ms)
[ RUN      ] Substrait.CallSpecialCaseRoundTrip
[       OK ] Substrait.CallSpecialCaseRoundTrip (0 ms)
[ RUN      ] Substrait.CallExtensionFunction
[       OK ] Substrait.CallExtensionFunction (0 ms)
[ RUN      ] Substrait.ReadRel
Thread 1 "arrow-substrait" received signal SIGSEGV, Segmentation fault.
0x555b02e6 in 
testing::internal::MatcherBase, std::allocator > const&>::MatchAndExplain 
(listener=0x7fffb3a0, x=..., 
    this=) at 
/mnt/soft1/tscontract/pkg/miniconda3/envs/pyarrow-dev/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/shared_ptr_base.h:1324
1324          get() const noexcept
(gdb) bt
#0  0x555b02e6 in 
testing::internal::MatcherBase, std::allocator > const&>::MatchAndExplain 
(listener=0x7fffb3a0, x=..., 
    this=) at 
/mnt/soft1/tscontract/pkg/miniconda3/envs/pyarrow-dev/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/shared_ptr_base.h:1324
#1  
testing::internal::UnorderedElementsAreMatcherImpl, std::allocator >, 
std::allocator, 
std::allocator > > > 
const&>::AnalyzeElements<__gnu_cxx::__normal_iterator, std::allocator > const*, 
std::vector, 
std::allocator >, std::allocator, std::allocator > > > > > 
(listener=0x7fffb640, 
    

[jira] [Created] (ARROW-16770) Arrow Substrait test fails with SIGSEGV, possibly due to gtest 1.11.0

2022-06-07 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-16770:
---

 Summary: Arrow Substrait test fails with SIGSEGV, possibly due to 
gtest 1.11.0
 Key: ARROW-16770
 URL: https://issues.apache.org/jira/browse/ARROW-16770
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Yaron Gvili


I built Arrow using the instructions in the Python development page, under the 
pyarrow-dev environment, and found that `arrow-substrait-substrait-test` fails 
with SIGSEGV - see gdb session below. The same Arrow build and runs correctly 
on my system, outside of pyarrow-dev. I suspect this is due to something 
different about gtest 1.11.0 as compared to gtest 1.10.0 based on the following 
observations:
 # The backtrace in the gdb session shows gtest 1.11.0 is used.
 # The backtrace also shows the error is deep inside gtest, working on an 
`UnorderedElementsAre` expectation.
 # My system, outside pyarrow-dev, uses gtest 1.10.0.

 
{noformat}
$ gdb --args ./release/arrow-substrait-substrait-test 
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
    .
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./release/arrow-substrait-substrait-test...
(gdb) run
Starting program: 
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/build/debug/release/arrow-substrait-substrait-test
 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x741ff700 (LWP 115128)]
Running main() from 
/home/conda/feedstock_root/build_artifacts/gtest_1647154636757/work/googletest/src/gtest_main.cc
[==] Running 33 tests from 3 test suites.
[--] Global test environment set-up.
[--] 4 tests from ExtensionIdRegistryTest
[ RUN      ] ExtensionIdRegistryTest.RegisterTempTypes
[       OK ] ExtensionIdRegistryTest.RegisterTempTypes (0 ms)
[ RUN      ] ExtensionIdRegistryTest.RegisterTempFunctions
[       OK ] ExtensionIdRegistryTest.RegisterTempFunctions (0 ms)
[ RUN      ] ExtensionIdRegistryTest.RegisterNestedTypes
[       OK ] ExtensionIdRegistryTest.RegisterNestedTypes (0 ms)
[ RUN      ] ExtensionIdRegistryTest.RegisterNestedFunctions
[       OK ] ExtensionIdRegistryTest.RegisterNestedFunctions (0 ms)
[--] 4 tests from ExtensionIdRegistryTest (0 ms total)
[--] 21 tests from Substrait
[ RUN      ] Substrait.SupportedTypes
[       OK ] Substrait.SupportedTypes (0 ms)
[ RUN      ] Substrait.SupportedExtensionTypes
[       OK ] Substrait.SupportedExtensionTypes (0 ms)
[ RUN      ] Substrait.NamedStruct
[       OK ] Substrait.NamedStruct (0 ms)
[ RUN      ] Substrait.NoEquivalentArrowType
[       OK ] Substrait.NoEquivalentArrowType (0 ms)
[ RUN      ] Substrait.NoEquivalentSubstraitType
[       OK ] Substrait.NoEquivalentSubstraitType (0 ms)
[ RUN      ] Substrait.SupportedLiterals
[       OK ] Substrait.SupportedLiterals (1 ms)
[ RUN      ] Substrait.CannotDeserializeLiteral
[       OK ] Substrait.CannotDeserializeLiteral (0 ms)
[ RUN      ] Substrait.FieldRefRoundTrip
[       OK ] Substrait.FieldRefRoundTrip (1 ms)
[ RUN      ] Substrait.RecursiveFieldRef
[       OK ] Substrait.RecursiveFieldRef (0 ms)
[ RUN      ] Substrait.FieldRefsInExpressions
[       OK ] Substrait.FieldRefsInExpressions (0 ms)
[ RUN      ] Substrait.CallSpecialCaseRoundTrip
[       OK ] Substrait.CallSpecialCaseRoundTrip (0 ms)
[ RUN      ] Substrait.CallExtensionFunction
[       OK ] Substrait.CallExtensionFunction (0 ms)
[ RUN      ] Substrait.ReadRel
Thread 1 "arrow-substrait" received signal SIGSEGV, Segmentation fault.
0x555b02e6 in 
testing::internal::MatcherBase, std::allocator > const&>::MatchAndExplain 
(listener=0x7fffb3a0, x=..., 
    this=) at 
/mnt/soft1/tscontract/pkg/miniconda3/envs/pyarrow-dev/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/shared_ptr_base.h:1324
1324          get() const noexcept
(gdb) bt
#0  0x555b02e6 in 
testing::internal::MatcherBase, std::allocator > const&>::MatchAndExplain 
(listener=0x7fffb3a0, x=..., 
    this=) at 
/mnt/soft1/tscontract/pkg/miniconda3/envs/pyarrow-dev/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/shared_ptr_base.h:1324
#1  
testing::internal::UnorderedElementsAreMatcherImpl, std::allocator >, 
std::allocator, 
std::allocator > > > 

[jira] [Assigned] (ARROW-16681) [Python] Fix doc for PyArrow unit tests dependant on module path

2022-06-06 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili reassigned ARROW-16681:
---

Assignee: Yaron Gvili

> [Python] Fix doc for PyArrow unit tests dependant on module path
> 
>
> Key: ARROW-16681
> URL: https://issues.apache.org/jira/browse/ARROW-16681
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The [PyArrow dev docs|https://arrow.apache.org/docs/developers/python.html] 
> currently gives
> {code:java}
> python -m pytest arrow/python/pyarrow{code}
> as the unit-testing command, however there are some unit tests (see list at 
> the bottom) that currently fail because they depend on the "arrow/python" 
> directory being included in the module path whereas it includes instead the 
> current directory - see example below.
>  
> The fix should be either in the unit tests, to avoid the dependency on the 
> current directory being in the module path, or in the documentation, to 
> instruct that the unit tests should be run from the "arrow/python" directory.
> As example for a failed test, for
> {code:java}
> python -m pytest 
> arrow/python/pyarrow/tests/test_misc.py::test_runtime_info{code}
> I'm getting an error with this tail:
>  
> {noformat}
> Traceback (most recent call last):
>   File "", line 2, in 
> ModuleNotFoundError: No module named 'pyarrow'
> =
>  short test summary info 
> ==
> FAILED arrow/python/pyarrow/tests/test_misc.py::test_runtime_info - 
> subprocess.CalledProcessError: Command 
> '['/mnt/soft1/tscontract/pkg/miniconda3/envs/pyarrow-dev/bin/python', '-c', 
> "if 1:\n          ...{noformat}
> Here is the list of unit tests I'm getting a similar error:
>  
>  
> {noformat}
> test_memory.py::test_env_var
> test_memory.py::test_debug_memory_pool_abort[default_memory_pool]
> test_memory.py::test_debug_memory_pool_abort[jemalloc_memory_pool]
> test_memory.py::test_debug_memory_pool_abort[system_memory_pool]
> test_memory.py::test_debug_memory_pool_trap[default_memory_pool]
> test_memory.py::test_debug_memory_pool_trap[jemalloc_memory_pool]
> test_memory.py::test_debug_memory_pool_trap[system_memory_pool]
> test_memory.py::test_debug_memory_pool_warn[default_memory_pool]
> test_memory.py::test_debug_memory_pool_warn[jemalloc_memory_pool]
> test_memory.py::test_debug_memory_pool_warn[system_memory_pool]
> test_memory.py::test_debug_memory_pool_disabled[default_memory_pool]
> test_memory.py::test_debug_memory_pool_disabled[jemalloc_memory_pool]
> test_memory.py::test_debug_memory_pool_disabled[system_memory_pool]
> test_misc.py::test_env_var_io_thread_count
> test_misc.py::test_runtime_info{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16681) [Python] Fix doc for PyArrow unit tests dependant on module path

2022-06-06 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-16681:

Summary: [Python] Fix doc for PyArrow unit tests dependant on module path  
(was: [Python] Fix PyArrow unit tests dependant on module path)

> [Python] Fix doc for PyArrow unit tests dependant on module path
> 
>
> Key: ARROW-16681
> URL: https://issues.apache.org/jira/browse/ARROW-16681
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Yaron Gvili
>Priority: Major
>
> The [PyArrow dev docs|https://arrow.apache.org/docs/developers/python.html] 
> currently gives
> {code:java}
> python -m pytest arrow/python/pyarrow{code}
> as the unit-testing command, however there are some unit tests (see list at 
> the bottom) that currently fail because they depend on the "arrow/python" 
> directory being included in the module path whereas it includes instead the 
> current directory - see example below.
>  
> The fix should be either in the unit tests, to avoid the dependency on the 
> current directory being in the module path, or in the documentation, to 
> instruct that the unit tests should be run from the "arrow/python" directory.
> As example for a failed test, for
> {code:java}
> python -m pytest 
> arrow/python/pyarrow/tests/test_misc.py::test_runtime_info{code}
> I'm getting an error with this tail:
>  
> {noformat}
> Traceback (most recent call last):
>   File "", line 2, in 
> ModuleNotFoundError: No module named 'pyarrow'
> =
>  short test summary info 
> ==
> FAILED arrow/python/pyarrow/tests/test_misc.py::test_runtime_info - 
> subprocess.CalledProcessError: Command 
> '['/mnt/soft1/tscontract/pkg/miniconda3/envs/pyarrow-dev/bin/python', '-c', 
> "if 1:\n          ...{noformat}
> Here is the list of unit tests I'm getting a similar error:
>  
>  
> {noformat}
> test_memory.py::test_env_var
> test_memory.py::test_debug_memory_pool_abort[default_memory_pool]
> test_memory.py::test_debug_memory_pool_abort[jemalloc_memory_pool]
> test_memory.py::test_debug_memory_pool_abort[system_memory_pool]
> test_memory.py::test_debug_memory_pool_trap[default_memory_pool]
> test_memory.py::test_debug_memory_pool_trap[jemalloc_memory_pool]
> test_memory.py::test_debug_memory_pool_trap[system_memory_pool]
> test_memory.py::test_debug_memory_pool_warn[default_memory_pool]
> test_memory.py::test_debug_memory_pool_warn[jemalloc_memory_pool]
> test_memory.py::test_debug_memory_pool_warn[system_memory_pool]
> test_memory.py::test_debug_memory_pool_disabled[default_memory_pool]
> test_memory.py::test_debug_memory_pool_disabled[jemalloc_memory_pool]
> test_memory.py::test_debug_memory_pool_disabled[system_memory_pool]
> test_misc.py::test_env_var_io_thread_count
> test_misc.py::test_runtime_info{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16211) [C++][Python] Unregister compute functions

2022-06-02 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545600#comment-17545600
 ] 

Yaron Gvili commented on ARROW-16211:
-

See [this PR|https://github.com/apache/arrow/pull/13232] as a solution for 
[this 
comment|https://issues.apache.org/jira/browse/ARROW-16211?focusedCommentId=17539044=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17539044].

> [C++][Python] Unregister compute functions
> --
>
> Key: ARROW-16211
> URL: https://issues.apache.org/jira/browse/ARROW-16211
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In general, when using UDFs, the user defines a function expecting a 
> particular outcome. When building the program, there needs to be a way to 
> update existing function kernels if it expands beyond what is planned before. 
> In such situations, there should be a way to remove the existing definition 
> and add a new definition. To enable this, the unregister functionality has to 
> be included. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16681) [Python] Fix PyArrow unit tests dependant on module path

2022-06-01 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544791#comment-17544791
 ] 

Yaron Gvili commented on ARROW-16681:
-

cc [~lidavidm] - making sure this is noticed.

> [Python] Fix PyArrow unit tests dependant on module path
> 
>
> Key: ARROW-16681
> URL: https://issues.apache.org/jira/browse/ARROW-16681
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Yaron Gvili
>Priority: Major
>
> The [PyArrow dev docs|https://arrow.apache.org/docs/developers/python.html] 
> currently gives
> {code:java}
> python -m pytest arrow/python/pyarrow{code}
> as the unit-testing command, however there are some unit tests (see list at 
> the bottom) that currently fail because they depend on the "arrow/python" 
> directory being included in the module path whereas it includes instead the 
> current directory - see example below.
>  
> The fix should be either in the unit tests, to avoid the dependency on the 
> current directory being in the module path, or in the documentation, to 
> instruct that the unit tests should be run from the "arrow/python" directory.
> As example for a failed test, for
> {code:java}
> python -m pytest 
> arrow/python/pyarrow/tests/test_misc.py::test_runtime_info{code}
> I'm getting an error with this tail:
>  
> {noformat}
> Traceback (most recent call last):
>   File "", line 2, in 
> ModuleNotFoundError: No module named 'pyarrow'
> =
>  short test summary info 
> ==
> FAILED arrow/python/pyarrow/tests/test_misc.py::test_runtime_info - 
> subprocess.CalledProcessError: Command 
> '['/mnt/soft1/tscontract/pkg/miniconda3/envs/pyarrow-dev/bin/python', '-c', 
> "if 1:\n          ...{noformat}
> Here is the list of unit tests I'm getting a similar error:
>  
>  
> {noformat}
> test_memory.py::test_env_var
> test_memory.py::test_debug_memory_pool_abort[default_memory_pool]
> test_memory.py::test_debug_memory_pool_abort[jemalloc_memory_pool]
> test_memory.py::test_debug_memory_pool_abort[system_memory_pool]
> test_memory.py::test_debug_memory_pool_trap[default_memory_pool]
> test_memory.py::test_debug_memory_pool_trap[jemalloc_memory_pool]
> test_memory.py::test_debug_memory_pool_trap[system_memory_pool]
> test_memory.py::test_debug_memory_pool_warn[default_memory_pool]
> test_memory.py::test_debug_memory_pool_warn[jemalloc_memory_pool]
> test_memory.py::test_debug_memory_pool_warn[system_memory_pool]
> test_memory.py::test_debug_memory_pool_disabled[default_memory_pool]
> test_memory.py::test_debug_memory_pool_disabled[jemalloc_memory_pool]
> test_memory.py::test_debug_memory_pool_disabled[system_memory_pool]
> test_misc.py::test_env_var_io_thread_count
> test_misc.py::test_runtime_info{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16681) [Python] Fix PyArrow unit tests dependant on module path

2022-05-28 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-16681:
---

 Summary: [Python] Fix PyArrow unit tests dependant on module path
 Key: ARROW-16681
 URL: https://issues.apache.org/jira/browse/ARROW-16681
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Yaron Gvili


The [PyArrow dev docs|https://arrow.apache.org/docs/developers/python.html] 
currently gives
{code:java}
python -m pytest arrow/python/pyarrow{code}
as the unit-testing command, however there are some unit tests (see list at the 
bottom) that currently fail because they depend on the "arrow/python" directory 
being included in the module path whereas it includes instead the current 
directory - see example below.

 

The fix should be either in the unit tests, to avoid the dependency on the 
current directory being in the module path, or in the documentation, to 
instruct that the unit tests should be run from the "arrow/python" directory.

As example for a failed test, for
{code:java}
python -m pytest 
arrow/python/pyarrow/tests/test_misc.py::test_runtime_info{code}
I'm getting an error with this tail:

 
{noformat}
Traceback (most recent call last):
  File "", line 2, in 
ModuleNotFoundError: No module named 'pyarrow'
=
 short test summary info 
==
FAILED arrow/python/pyarrow/tests/test_misc.py::test_runtime_info - 
subprocess.CalledProcessError: Command 
'['/mnt/soft1/tscontract/pkg/miniconda3/envs/pyarrow-dev/bin/python', '-c', "if 
1:\n          ...{noformat}
Here is the list of unit tests I'm getting a similar error:

 

 
{noformat}
test_memory.py::test_env_var
test_memory.py::test_debug_memory_pool_abort[default_memory_pool]
test_memory.py::test_debug_memory_pool_abort[jemalloc_memory_pool]
test_memory.py::test_debug_memory_pool_abort[system_memory_pool]
test_memory.py::test_debug_memory_pool_trap[default_memory_pool]
test_memory.py::test_debug_memory_pool_trap[jemalloc_memory_pool]
test_memory.py::test_debug_memory_pool_trap[system_memory_pool]
test_memory.py::test_debug_memory_pool_warn[default_memory_pool]
test_memory.py::test_debug_memory_pool_warn[jemalloc_memory_pool]
test_memory.py::test_debug_memory_pool_warn[system_memory_pool]
test_memory.py::test_debug_memory_pool_disabled[default_memory_pool]
test_memory.py::test_debug_memory_pool_disabled[jemalloc_memory_pool]
test_memory.py::test_debug_memory_pool_disabled[system_memory_pool]
test_misc.py::test_env_var_io_thread_count
test_misc.py::test_runtime_info{noformat}
 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16657) [C++] Support nesting of extension-id-registries

2022-05-27 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-16657:

Summary: [C++] Support nesting of extension-id-registries  (was: Support 
nesting of extension-id-registries)

> [C++] Support nesting of extension-id-registries
> 
>
> Key: ARROW-16657
> URL: https://issues.apache.org/jira/browse/ARROW-16657
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, only a default extension-id-registry is supported (under 
> arrow/engine/substrait). Modifying this registry has global effects, which is 
> often undesirable. Support for nesting extension-id-registries will provide 
> scoping for such modifications.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16677) [C++] Support nesting of function registries

2022-05-27 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-16677:
---

 Summary: [C++] Support nesting of function registries
 Key: ARROW-16677
 URL: https://issues.apache.org/jira/browse/ARROW-16677
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili


h4. 
Currently, only a default function-registry is supported. Modifying this 
registry has global effects, which is often undesirable. Support for nesting 
function-registries will provide scoping for such modifications.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16657) Support nesting of extension-id-registries

2022-05-25 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili reassigned ARROW-16657:
---

Assignee: Yaron Gvili

> Support nesting of extension-id-registries
> --
>
> Key: ARROW-16657
> URL: https://issues.apache.org/jira/browse/ARROW-16657
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> Currently, only a default extension-id-registry is supported (under 
> arrow/engine/substrait). Modifying this registry has global effects, which is 
> often undesirable. Support for nesting extension-id-registries will provide 
> scoping for such modifications.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16657) Support nesting of extension-id-registries

2022-05-25 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-16657:
---

 Summary: Support nesting of extension-id-registries
 Key: ARROW-16657
 URL: https://issues.apache.org/jira/browse/ARROW-16657
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Yaron Gvili


Currently, only a default extension-id-registry is supported (under 
arrow/engine/substrait). Modifying this registry has global effects, which is 
often undesirable. Support for nesting extension-id-registries will provide 
scoping for such modifications.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16211) [C++][Python] Unregister compute functions

2022-05-22 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540571#comment-17540571
 ] 

Yaron Gvili commented on ARROW-16211:
-

I created a [PR for a nested 
extension-id-registry|https://github.com/apache/arrow/pull/13214] using the 
approach I proposed. Please let me know what you think, as I have my own use 
cases for this nested extension-registry-id.

> [C++][Python] Unregister compute functions
> --
>
> Key: ARROW-16211
> URL: https://issues.apache.org/jira/browse/ARROW-16211
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In general, when using UDFs, the user defines a function expecting a 
> particular outcome. When building the program, there needs to be a way to 
> update existing function kernels if it expands beyond what is planned before. 
> In such situations, there should be a way to remove the existing definition 
> and add a new definition. To enable this, the unregister functionality has to 
> be included. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16211) [C++][Python] Unregister compute functions

2022-05-18 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539052#comment-17539052
 ] 

Yaron Gvili commented on ARROW-16211:
-

This second-layer-registry approach is good for another use case in which the 
user runs multiple execution engine invocations, either in sequence or in 
parallel, from the same Python interpreter and wants to keep separate the UDFs 
registered in each invocation.

> [C++][Python] Unregister compute functions
> --
>
> Key: ARROW-16211
> URL: https://issues.apache.org/jira/browse/ARROW-16211
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In general, when using UDFs, the user defines a function expecting a 
> particular outcome. When building the program, there needs to be a way to 
> update existing function kernels if it expands beyond what is planned before. 
> In such situations, there should be a way to remove the existing definition 
> and add a new definition. To enable this, the unregister functionality has to 
> be included. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16211) [C++][Python] Unregister compute functions

2022-05-18 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539044#comment-17539044
 ] 

Yaron Gvili commented on ARROW-16211:
-

Another alternative to consider is registering Python UDFs to an extension 
registry instance that (1) is specific to the Python interpreter and (2) is 
linked to the default global one (so it can find both UDF and normal 
functions). This Python-specific registry would then be passed to be used by 
the execution engine. I think this way (only) the Python-specific registry 
would naturally get cleaned up on finalization of the Python interpreter.

> [C++][Python] Unregister compute functions
> --
>
> Key: ARROW-16211
> URL: https://issues.apache.org/jira/browse/ARROW-16211
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In general, when using UDFs, the user defines a function expecting a 
> particular outcome. When building the program, there needs to be a way to 
> update existing function kernels if it expands beyond what is planned before. 
> In such situations, there should be a way to remove the existing definition 
> and add a new definition. To enable this, the unregister functionality has to 
> be included. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16582) [Python] Include DATASET in list of components in PyArrow's dev page

2022-05-18 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538603#comment-17538603
 ] 

Yaron Gvili commented on ARROW-16582:
-

Another possible fix is for the build to automatically select DATASET if some 
other component, like PARQUET, is selected.

> [Python] Include DATASET in list of components in PyArrow's dev page
> 
>
> Key: ARROW-16582
> URL: https://issues.apache.org/jira/browse/ARROW-16582
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Yaron Gvili
>Priority: Major
> Fix For: 9.0.0
>
>
> PyArrow's dev page has a [build-and-test 
> section|https://arrow.apache.org/docs/developers/python.html#build-and-test] 
> that currently does not list DATASET as a component. Using a recent Arrow 
> version (commit e5e490), I observed DATASET was mandatory for the successful 
> completion of the test suite ran by `{color:#201f1e}python -m pytest 
> pyarrow/{color}`, as recommended on the page. Without `export 
> PYARROW_WITH_DATASET=1`, I observed errors with `test_dataset.py`, 
> `test_exec_plan.py`, and a couple others.
> Since DATASET is intended to be an optional component, it should be listed on 
> this section. In addition, the documented test suite command should be 
> updated to one that doesn't fail without the DATASET component being selected 
> (or else the test suite itself should be fixed).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16582) [Python] Include DATASET in list of components in PyArrow's dev page

2022-05-16 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537585#comment-17537585
 ] 

Yaron Gvili commented on ARROW-16582:
-

[~raulcd], I used commit e5e490 of Arrow 
(https://github.com/apache/arrow/tree/e5e4901eb84224f353193bb4f512d60e82e40aa9).

> [Python] Include DATASET in list of components in PyArrow's dev page
> 
>
> Key: ARROW-16582
> URL: https://issues.apache.org/jira/browse/ARROW-16582
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Yaron Gvili
>Priority: Major
>
> PyArrow's dev page has a [build-and-test 
> section|https://arrow.apache.org/docs/developers/python.html#build-and-test] 
> that currently does not list DATASET as a component. Using a recent Arrow 
> version (commit e5e490), I observed DATASET was mandatory for the successful 
> completion of the test suite ran by `{color:#201f1e}python -m pytest 
> pyarrow/{color}`, as recommended on the page. Without `export 
> PYARROW_WITH_DATASET=1`, I observed errors with `test_dataset.py`, 
> `test_exec_plan.py`, and a couple others.
> Since DATASET is intended to be an optional component, it should be listed on 
> this section. In addition, the documented test suite command should be 
> updated to one that doesn't fail without the DATASET component being selected 
> (or else the test suite itself should be fixed).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


  1   2   >