[ https://issues.apache.org/jira/browse/SPARK-49919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-49919: ----------------------------------- Labels: pull-request-available (was: ) > SpecialLimits strategy doesn't work when return the content of the Dataset as > a Dataset of JSON strings > ------------------------------------------------------------------------------------------------------- > > Key: SPARK-49919 > URL: https://issues.apache.org/jira/browse/SPARK-49919 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 4.0.0, 3.5.3 > Reporter: Lantao Jin > Priority: Major > Labels: pull-request-available > > `CollectLimitExec` is used when a logical `Limit` and/or `Offset` operation > is the final operator. Comparing to `GlobalLimitExec`, it can avoid shuffle > data to a single output partition. > But when the dataset is collected as a Dataset of JSON strings. The > `GlobalLimitExec` and `TakeOrderedAndProjectExec` are not able to applied > since the SpecialLimits strategy cannot work as expected. > Here is an example: > {code:java} > scala> spark.sql("select * from right_t limit 4").explain > == Physical Plan == > CollectLimit 4 > +- Scan hive spark_catalog.default.right_t [id#23, name#24], > HiveTableRelation [`spark_catalog`.`default`.`right_t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#23, > name#24], Partition Cols: []] > scala> spark.sql("select * from right_t limit 4").toJSON.explain > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- SerializeFromObject [staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, > java.lang.String, true], true, false, true) AS value#34] > +- MapPartitions > org.apache.spark.sql.Dataset$$Lambda/0x00000070021d8c58@5b17838a, obj#33: > java.lang.String > +- DeserializeToObject createexternalrow(staticinvoke(class > java.lang.Integer, ObjectType(class java.lang.Integer), valueOf, id#27, true, > false, true), name#28.toString, StructField(id,IntegerType,true), > StructField(name,StringType,true)), obj#32: org.apache.spark.sql.Row > +- GlobalLimit 4, 0 > +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=42] > +- LocalLimit 4 > +- Scan hive spark_catalog.default.right_t [id#27, > name#28], HiveTableRelation [`spark_catalog`.`default`.`right_t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#27, > name#28], Partition Cols: []] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org