Lantao Jin created SPARK-49919:
----------------------------------

             Summary: SpecialLimits strategy doesn't work when return the 
content of the Dataset as a Dataset of JSON strings
                 Key: SPARK-49919
                 URL: https://issues.apache.org/jira/browse/SPARK-49919
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.5.3, 4.0.0
            Reporter: Lantao Jin


`CollectLimitExec` is used when a logical `Limit` and/or `Offset` operation is 
the final operator. Comparing to `GlobalLimitExec`, it can avoid shuffle data 
to a single output partition.
But when the dataset is collected as a Dataset of JSON strings. The 
`GlobalLimitExec` and `TakeOrderedAndProjectExec` are not able to applied since 
the SpecialLimits strategy cannot work as expected.
Here is an example:
```
scala> spark.sql("select * from right_t limit 4").explain
== Physical Plan ==
CollectLimit 4
+- Scan hive spark_catalog.default.right_t [id#23, name#24], HiveTableRelation 
[`spark_catalog`.`default`.`right_t`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#23, 
name#24], Partition Cols: []]

scala> spark.sql("select * from right_t limit 4").toJSON.explain
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, 
java.lang.String, true], true, false, true) AS value#34]
   +- MapPartitions 
org.apache.spark.sql.Dataset$$Lambda/0x00000070021d8c58@5b17838a, obj#33: 
java.lang.String
      +- DeserializeToObject createexternalrow(staticinvoke(class 
java.lang.Integer, ObjectType(class java.lang.Integer), valueOf, id#27, true, 
false, true), name#28.toString, StructField(id,IntegerType,true), 
StructField(name,StringType,true)), obj#32: org.apache.spark.sql.Row
         +- GlobalLimit 4, 0
            +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=42]
               +- LocalLimit 4
                  +- Scan hive spark_catalog.default.right_t [id#27, name#28], 
HiveTableRelation [`spark_catalog`.`default`.`right_t`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#27, 
name#28], Partition Cols: []]
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to