Ivan Tsukanov created SPARK-32758: ------------------------------------- Summary: Spark ignores limit(1) and starts tasks for all partition Key: SPARK-32758 URL: https://issues.apache.org/jira/browse/SPARK-32758 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Environment: должен Reporter: Ivan Tsukanov
If we run the following code {code:scala} val sparkConf = new SparkConf() .setAppName("test-app") .setMaster("local[1]") val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() import sparkSession.implicits._ val df = (1 to 100000) .toDF("c1") .repartition(1000) implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) df.limit(1) .map(identity) .collect() df.map(identity) .limit(1) .collect() Thread.sleep(100000) {code} we will see that spark started 1002 tasks despite the fact there is limit(1) - !image-2020-09-01-10-34-47-580.png! Expected behavior - both scenarios (limit before and after map) will produce the same results - one or two tasks to get one value from the DataFrame. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org