[jira] [Commented] (ARROW-16796) [C++] Fix bad defaulting of ExecContext argument

2022-06-10 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552569#comment-17552569
 ] 

Yaron Gvili commented on ARROW-16796:
-

If coding safety is a major concern here (IMHO it is), I'd suggest that in the 
longer-term Arrow code should distinguish between simplification of expressions 
with and without functions/execution, where only the former requires an 
ExecContext whereas only the latter will fail if a function exists in the 
expression. Perhaps the simplest, though likely not ideal, code-change for this 
is by defaulting ExecContext to an implementation that fails.

The purpose of the PR is just to fix in the short-term. Follow-up issues can be 
created for what remains.

> [C++] Fix bad defaulting of ExecContext argument
> 
>
> Key: ARROW-16796
> URL: https://issues.apache.org/jira/browse/ARROW-16796
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In several places in Arrow code, invocations of Expression::Bind() default 
> the ExecContext argument. This leads to the default function registry being 
> used in expression manipulations, and this becomes a problem when the user 
> wishes to use a non-default function registry, e.g., when passing one to the 
> ExecContext of an ExecPlan, which is how I discovered this issue. The 
> problematic places I found for such Expression::Bind() invocation are:
>  * cpp/src/arrow/dataset/file_parquet.cc
>  * cpp/src/arrow/dataset/scanner.cc
>  * cpp/src/arrow/compute/exec/project_node.cc
>  * cpp/src/arrow/compute/exec/hash_join_node.cc
>  * cpp/src/arrow/compute/exec/filter_node.cc
> There are also other places in test and benchmark code (grep for 'Bind()').
> Another case of bad defaulting of an ExecContext argument is in 
> Inequality::simplifies_to in cpp/src/compute/exec/expression.cc where a fresh 
> ExecContext is created, instead of being received from the caller, and passed 
> to BindNonRecursive.
> I'd argue that an ExecContext variable should not be allowed to default, 
> except perhaps in the highest-level/user-facing APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16796) [C++] Fix bad defaulting of ExecContext argument

2022-06-10 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552568#comment-17552568
 ] 

Yaron Gvili commented on ARROW-16796:
-

Copying [Weston Pace's 
post|https://github.com/apache/arrow/pull/13355#issuecomment-1151679039]:

Good catch. I wonder if we should remove the default argument to bind entirely 
(it would look something like 
[westonpace@{{{}c9ae1dd{}}}|https://github.com/westonpace/arrow/commit/c9ae1dd6a0857af69e48a95ec76480f4c466791e]
 ). Looks like there are only a few other non-test spots we call bind.
 * In the parquet reader we convert statistics into expressions and bind them 
to the schema. These expressions will only use min/max and it's only really for 
simplification and not execution so we're probably ok.
 * The scanner has a number of methods that create exec plans (this is the 
"lightweight producer" half of the scanner). We could arguably add an 
ExecContext to scan options but I think it would better to start phasing out 
this half of the scanner in favor of direct use of exec plans instead.

> [C++] Fix bad defaulting of ExecContext argument
> 
>
> Key: ARROW-16796
> URL: https://issues.apache.org/jira/browse/ARROW-16796
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In several places in Arrow code, invocations of Expression::Bind() default 
> the ExecContext argument. This leads to the default function registry being 
> used in expression manipulations, and this becomes a problem when the user 
> wishes to use a non-default function registry, e.g., when passing one to the 
> ExecContext of an ExecPlan, which is how I discovered this issue. The 
> problematic places I found for such Expression::Bind() invocation are:
>  * cpp/src/arrow/dataset/file_parquet.cc
>  * cpp/src/arrow/dataset/scanner.cc
>  * cpp/src/arrow/compute/exec/project_node.cc
>  * cpp/src/arrow/compute/exec/hash_join_node.cc
>  * cpp/src/arrow/compute/exec/filter_node.cc
> There are also other places in test and benchmark code (grep for 'Bind()').
> Another case of bad defaulting of an ExecContext argument is in 
> Inequality::simplifies_to in cpp/src/compute/exec/expression.cc where a fresh 
> ExecContext is created, instead of being received from the caller, and passed 
> to BindNonRecursive.
> I'd argue that an ExecContext variable should not be allowed to default, 
> except perhaps in the highest-level/user-facing APIs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)