davidm-db commented on code in PR #47403: URL: https://github.com/apache/spark/pull/47403#discussion_r1687861052
########## sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala: ########## @@ -650,14 +657,27 @@ class SparkSession private( private[sql] def sql(sqlText: String, args: Array[_], tracker: QueryPlanningTracker): DataFrame = withActive { val plan = tracker.measurePhase(QueryPlanningTracker.PARSING) { - val parsedPlan = sessionState.sqlParser.parsePlan(sqlText) - if (args.nonEmpty) { - PosParameterizedQuery(parsedPlan, args.map(lit(_).expr).toImmutableArraySeq) - } else { - parsedPlan + val parsedPlan = sessionState.sqlParser.parseScript(sqlText) + parsedPlan match { + case CompoundBody(Seq(singleStmtPlan: SingleStatement), label) if args.nonEmpty => + CompoundBody(Seq(SingleStatement( + PosParameterizedQuery( + singleStmtPlan.parsedPlan, args.map(lit(_).expr).toImmutableArraySeq))), label) + case p => + assert(args.isEmpty, "Named parameters are not supported for batch queries") + p } } - Dataset.ofRows(self, plan, tracker) + + plan match { + case CompoundBody(Seq(singleStmtPlan: SingleStatement), _) => + Dataset.ofRows(self, singleStmtPlan.parsedPlan, tracker) + case _ => + // execute the plan directly if it is not a single statement + val lastRow = executeScript(plan).foldLeft(Array.empty[Row])((_, next) => next) + val attributes = DataTypeUtils.toAttributes(lastRow.head.schema) + Dataset.ofRows(self, LocalRelation.fromExternalRows(attributes, lastRow.toIndexedSeq)) Review Comment: this is a bit of a hard topic at the moment... we decided on this approach for preview for multiple reasons: - this what we will do for initial version of the changes for Spark Connect execution - discussions around multiple topics are still open: - multiple results API - decisions around this will affect single results API as well - multiple results API - from correctness perspective, all statements (including SELECTs) need to be executed eagerly. It makes sense then to have the same behavior with single results API as well - in this case all statements are executed eagerly, but results for all of them except the last one) are dropped. - there is still an open discussion whether to include `return` statement - a ton of questions about stored procedures are still an open topic what will probably happen down the line is: - sql() API remains unchanged and only last DataFrame is returned (as you suggested). Requires still a lot of work to support Connect execution, current approach works with Connect already. - [optional] new API to do what we are doing at the moment. - new API for multiple results, stored procedures, execute immediate, etc. since the last part is still an open question, we figured out that we will do a simplest thing that works e2e in all cases and then, after we gather initial feedback from preview, and understand better what we want to do for stored procedures/multiple results, we should actually commit to implement all of the API changes. please let us know your thoughts on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org