[
https://issues.apache.org/jira/browse/SPARK-52492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kent Yao reassigned SPARK-52492:
--------------------------------
Assignee: Hongze Zhang
> Make InMemoryRelation.convertToColumnarIfPossible customizable
> --------------------------------------------------------------
>
> Key: SPARK-52492
> URL: https://issues.apache.org/jira/browse/SPARK-52492
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.0.0
> Reporter: Hongze Zhang
> Assignee: Hongze Zhang
> Priority: Major
>
> InMemoryRelation.convertToColumnarIfPossible is highly coupled with vanilla
> Spark's columnar processing logic. It unwraps the input columnar plan by
> removing the topmost ColumnarToRowExec, the assumes that the outcome RDD
> after this process can be recognized by the user-customized cache serializer.
>
> But sometimes this assertion is invalid. As in the Apache Gluten project, we
> may continue distiguishing plans that are all have `supportsColumnar=true`
> with different *columnar batch types.* So even the topmost
> `ColumnarToRowExec` is removed, we still don't know whether the columnar RDD
> unwrapped can be accepted by Gluten's cache serializer (assuming it only
> handles one certain type of columnar batch or something).
>
> So in Gluten we had a rule to workaround the logic in
> InMemoryRelation.convertToColumnarIfPossible:
> [https://github.com/apache/incubator-gluten/blob/c6461b4e0c7d3022a31fa832aeab588b1a3200e6/gluten-substrait/src/main/scala/org/apache/gluten/extension/columnar/MiscColumnarRules.scala#L192-L217].
> This is the best way we had thought about to get through the issue but it's
> still not elegant, especially the rule is even caller-sensitive to determine
> whether it's called in the caching planning process or not.
>
> A simple solution would be move
> `InMemoryRelation.convertToColumnarIfPossible` to as a public API of
> `CachedBatchSerializer` so that plugins like Gluten could have the relevant
> logic customized for their own catch serializers.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]