[ 
https://issues.apache.org/jira/browse/SPARK-52492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-52492:
--------------------------------

    Assignee: Hongze Zhang

> Make InMemoryRelation.convertToColumnarIfPossible customizable
> --------------------------------------------------------------
>
>                 Key: SPARK-52492
>                 URL: https://issues.apache.org/jira/browse/SPARK-52492
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Hongze Zhang
>            Assignee: Hongze Zhang
>            Priority: Major
>
> InMemoryRelation.convertToColumnarIfPossible is highly coupled with vanilla 
> Spark's columnar processing logic. It unwraps the input columnar plan by 
> removing the topmost ColumnarToRowExec, the assumes that the outcome RDD 
> after this process can be recognized by the user-customized cache serializer.
>  
> But sometimes this assertion is invalid. As in the Apache Gluten project, we 
> may continue distiguishing plans that are all have `supportsColumnar=true` 
> with different *columnar batch types.* So even the topmost 
> `ColumnarToRowExec` is removed, we still don't know whether the columnar RDD 
> unwrapped can be accepted by Gluten's cache serializer (assuming it only 
> handles one certain type of columnar batch or something).
>  
> So in Gluten we had a rule to workaround the logic in 
> InMemoryRelation.convertToColumnarIfPossible: 
> [https://github.com/apache/incubator-gluten/blob/c6461b4e0c7d3022a31fa832aeab588b1a3200e6/gluten-substrait/src/main/scala/org/apache/gluten/extension/columnar/MiscColumnarRules.scala#L192-L217].
>  This is the best way we had thought about to get through the issue but it's 
> still not elegant, especially the rule is even caller-sensitive to determine 
> whether it's called in the caching planning process or not.
>  
> A simple solution would be move 
> `InMemoryRelation.convertToColumnarIfPossible` to as a public API of 
> `CachedBatchSerializer` so that plugins like Gluten could have the relevant 
> logic customized for their own catch serializers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to