[ 
https://issues.apache.org/jira/browse/PARQUET-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653313#comment-17653313
 ] 

Gabor Szadovszky commented on PARQUET-2220:
-------------------------------------------

[~abhiSumo304], I agree eagerly storing the toString value is not a good idea. 
I don't think it has proper use case either. toString should be used for 
debugging purposes anyway so eagerly storing the value does not really make 
sense. Unfortunately, I don't work on the Parquet code base actively anymore. 
Feel free to put up a PR to fix this and I'll try to review it in time.

> Parquet Filter predicate storing nested string causing OOM's
> ------------------------------------------------------------
>
>                 Key: PARQUET-2220
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2220
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Abhishek Jain
>            Priority: Critical
>
> Each Instance of ColumnFilterPredicate stores the filter values in toString 
> variable eagerly. Which is not useful
> {code:java}
> static abstract class ColumnFilterPredicate<T extends Comparable<T>> 
> implements FilterPredicate, Serializable  {
>   private final Column<T> column;
>   private final T value;
>   private final String toString; 
> protected ColumnFilterPredicate(Column<T> column, T value) {
>   this.column = Objects.requireNonNull(column, "column cannot be null");
>   // Eq and NotEq allow value to be null, Lt, Gt, LtEq, GtEq however do not, 
> so they guard against
>   // null in their own constructors.
>   this.value = value;
>   String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
>   this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + 
> value + ")";
> }{code}
>  
>  
> If your filter predicate is too long/nested this can take a lot of memory 
> while creating Filter.
> We have seen in our productions this can go upto 4gbs of space while opening 
> multiple parquet readers
> Same thing is replicated in BinaryLogicalFilterPredicate. Where toString is 
> eagerly calculated and stored in string and lot of duplication is happening 
> while making And/or filter.
> I did not find use case of storing it so eagerly



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to