Lantao Jin created SPARK-49919: ---------------------------------- Summary: SpecialLimits strategy doesn't work when return the content of the Dataset as a Dataset of JSON strings Key: SPARK-49919 URL: https://issues.apache.org/jira/browse/SPARK-49919 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.3, 4.0.0 Reporter: Lantao Jin
`CollectLimitExec` is used when a logical `Limit` and/or `Offset` operation is the final operator. Comparing to `GlobalLimitExec`, it can avoid shuffle data to a single output partition. But when the dataset is collected as a Dataset of JSON strings. The `GlobalLimitExec` and `TakeOrderedAndProjectExec` are not able to applied since the SpecialLimits strategy cannot work as expected. Here is an example: ``` scala> spark.sql("select * from right_t limit 4").explain == Physical Plan == CollectLimit 4 +- Scan hive spark_catalog.default.right_t [id#23, name#24], HiveTableRelation [`spark_catalog`.`default`.`right_t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#23, name#24], Partition Cols: []] scala> spark.sql("select * from right_t limit 4").toJSON.explain == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false, true) AS value#34] +- MapPartitions org.apache.spark.sql.Dataset$$Lambda/0x00000070021d8c58@5b17838a, obj#33: java.lang.String +- DeserializeToObject createexternalrow(staticinvoke(class java.lang.Integer, ObjectType(class java.lang.Integer), valueOf, id#27, true, false, true), name#28.toString, StructField(id,IntegerType,true), StructField(name,StringType,true)), obj#32: org.apache.spark.sql.Row +- GlobalLimit 4, 0 +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=42] +- LocalLimit 4 +- Scan hive spark_catalog.default.right_t [id#27, name#28], HiveTableRelation [`spark_catalog`.`default`.`right_t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#27, name#28], Partition Cols: []] ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org