[ https://issues.apache.org/jira/browse/PARQUET-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653313#comment-17653313 ]
Gabor Szadovszky commented on PARQUET-2220: ------------------------------------------- [~abhiSumo304], I agree eagerly storing the toString value is not a good idea. I don't think it has proper use case either. toString should be used for debugging purposes anyway so eagerly storing the value does not really make sense. Unfortunately, I don't work on the Parquet code base actively anymore. Feel free to put up a PR to fix this and I'll try to review it in time. > Parquet Filter predicate storing nested string causing OOM's > ------------------------------------------------------------ > > Key: PARQUET-2220 > URL: https://issues.apache.org/jira/browse/PARQUET-2220 > Project: Parquet > Issue Type: Bug > Components: parquet-format > Reporter: Abhishek Jain > Priority: Critical > > Each Instance of ColumnFilterPredicate stores the filter values in toString > variable eagerly. Which is not useful > {code:java} > static abstract class ColumnFilterPredicate<T extends Comparable<T>> > implements FilterPredicate, Serializable { > private final Column<T> column; > private final T value; > private final String toString; > protected ColumnFilterPredicate(Column<T> column, T value) { > this.column = Objects.requireNonNull(column, "column cannot be null"); > // Eq and NotEq allow value to be null, Lt, Gt, LtEq, GtEq however do not, > so they guard against > // null in their own constructors. > this.value = value; > String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH); > this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + > value + ")"; > }{code} > > > If your filter predicate is too long/nested this can take a lot of memory > while creating Filter. > We have seen in our productions this can go upto 4gbs of space while opening > multiple parquet readers > Same thing is replicated in BinaryLogicalFilterPredicate. Where toString is > eagerly calculated and stored in string and lot of duplication is happening > while making And/or filter. > I did not find use case of storing it so eagerly -- This message was sent by Atlassian Jira (v8.20.10#820010)