[jira] [Updated] (SPARK-49919) SpecialLimits strategy doesn't work when return the content of the Dataset as a Dataset of JSON strings

ASF GitHub Bot (Jira) Thu, 10 Oct 2024 00:07:04 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-49919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated SPARK-49919:
-----------------------------------
    Labels: pull-request-available  (was: )

> SpecialLimits strategy doesn't work when return the content of the Dataset as 
> a Dataset of JSON strings
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-49919
>                 URL: https://issues.apache.org/jira/browse/SPARK-49919
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.0.0, 3.5.3
>            Reporter: Lantao Jin
>            Priority: Major
>              Labels: pull-request-available
>
> `CollectLimitExec` is used when a logical `Limit` and/or `Offset` operation 
> is the final operator. Comparing to `GlobalLimitExec`, it can avoid shuffle 
> data to a single output partition.
> But when the dataset is collected as a Dataset of JSON strings. The 
> `GlobalLimitExec` and `TakeOrderedAndProjectExec` are not able to applied 
> since the SpecialLimits strategy cannot work as expected.
> Here is an example:
> {code:java}
> scala> spark.sql("select * from right_t limit 4").explain
> == Physical Plan ==
> CollectLimit 4
> +- Scan hive spark_catalog.default.right_t [id#23, name#24], 
> HiveTableRelation [`spark_catalog`.`default`.`right_t`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#23, 
> name#24], Partition Cols: []]
> scala> spark.sql("select * from right_t limit 4").toJSON.explain
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, 
> java.lang.String, true], true, false, true) AS value#34]
>    +- MapPartitions 
> org.apache.spark.sql.Dataset$$Lambda/0x00000070021d8c58@5b17838a, obj#33: 
> java.lang.String
>       +- DeserializeToObject createexternalrow(staticinvoke(class 
> java.lang.Integer, ObjectType(class java.lang.Integer), valueOf, id#27, true, 
> false, true), name#28.toString, StructField(id,IntegerType,true), 
> StructField(name,StringType,true)), obj#32: org.apache.spark.sql.Row
>          +- GlobalLimit 4, 0
>             +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=42]
>                +- LocalLimit 4
>                   +- Scan hive spark_catalog.default.right_t [id#27, 
> name#28], HiveTableRelation [`spark_catalog`.`default`.`right_t`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#27, 
> name#28], Partition Cols: []]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49919) SpecialLimits strategy doesn't work when return the content of the Dataset as a Dataset of JSON strings

Reply via email to