Re: How to control count / size of output files for

Attila Zsolt Piros Wed, 24 Feb 2021 09:09:38 -0800

hi!

It is because of "spark.sql.shuffle.partitions". See the value 200 in the
physical plan at the rangepartitioning:



scala> val df = sc.parallelize(1 to 1000, 10).toDF("v").sort("v")
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [v: int]

scala> df.explain()
== Physical Plan ==
*(2) Sort [v#300 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(v#300 ASC NULLS FIRST, 200), true, [id=#334]
   +- *(1) Project [value#297 AS v#300]
      +- *(1) SerializeFromObject [input[0, int, false] AS value#297]
         +- Scan[obj#296]

scala> df.rdd.getNumPartitions
res13: Int = 200

Best Regards,
Attila







--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: How to control count / size of output files for

Reply via email to