[ 
https://issues.apache.org/jira/browse/CALCITE-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132573#comment-17132573
 ] 

Haisheng Yuan commented on CALCITE-3786:
----------------------------------------

The digest of RelNode in the patch does reuse RexNode parts, but the RexNode 
doesn't, which can still result in OOM like CALCITE-3784. The reuse of RexNode 
is the true pain point. 

If the RexNode also changes its Digest as in the patch, it can solve the OOM, 
because the digest (unlike digest string) size won't grow exponentially. 
However, the digest itself is just duplicate each field of the RelNode and 
RexNode again, which still double memory usage of RelNode and RexNode. 

I guess everyone knows that the digest is mainly used to check whether there is 
equivalent node in the planner quickly, without deep comparison. If we agree to 
discard the shallow comparison, I am totally fine with that. But what is the 
proposal's advantage over just removing the digest from RelNode and RexNode 
completely and caching the hashcode only?

In addition, for queries with complex and duplicate expressions, e.g. (a > 10 
and b < 1) or (a > 10 and b > 5) ... 
After query parser, the two RexCalls {{a > 10}} are different instances, in 
real case that can be arbitrary complex expressions, and repeat many times.  
And after rule transformations, like column pruning, all the RexNodes will 
become distinct instances and many of them duplicate again. This causes a lot 
of memory waste for large queries. With the so-called Digest in the proposal, 
it can consume less memory than the original string digest, still doubles the 
actual memory of RelNode and RexNode, but at the same time loses the ability of 
shallow comparison.

> Add Digest interface to enable efficient hashCode(equals) for RexNode and 
> RelNode
> ---------------------------------------------------------------------------------
>
>                 Key: CALCITE-3786
>                 URL: https://issues.apache.org/jira/browse/CALCITE-3786
>             Project: Calcite
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 1.21.0
>            Reporter: Vladimir Sitnikov
>            Assignee: Danny Chen
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Current digests for RexNode, RelNode, RelType, and similar cases use String 
> concatenation.
> It is easy to implement, however, it has drawbacks:
> 1) String objects cannot be reused. For instance, RexCall has operands, 
> however, the digest is duplicated. It causes extra memory use and extra CPU 
> for string copying
> 2) There's no way to have multiple #toString() methods. RelType might need 
> multiple digests: "including field names", "excluding field names".
> A suggested resolution might be behind the lines of
> {code:java}
> class Digest { // immutable
>   final int hashCode; // speedup hashCode and equals
>   final Object[] contents; // The values are either other Digest objects or 
> Strings
>   String toString(); // e.g. for debugging purposes
>   int compareTo(Digest); // e.g. for debugging purposes.
> }
> {code}
> Note how fields in Kotlin are aligned much better, and it makes it easier to 
> read:
> {code:java}
> class Digest { // immutable
>   val hashCode: Int // speedup hashCode and equals
>   val contents: Array<Any> // The values are either other Digest objects or 
> Strings
>   fun toString(): String // e.g. for debugging purposes
>   fun compareTo(other: Digest): Int // e.g. for debugging purposes.
> }
> {code}
> Then the digest for RexCall could be the bits relevant to RexCall itself + 
> digests of the operands (which can be reused as is)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to