(Move to user list.) Hi Kannan,
You need to set |mapred.map.tasks| to 1 in hive-site.xml. The reason is this line of code <https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68>, which overrides |spark.default.parallelism|. Also, |spark.sql.shuffle.parallelism| isn’t used here since there’s no shuffle involved (we only need to sort within a partition).
Default value of |mapred.map.tasks| is 2 <https://hadoop.apache.org/docs/r1.0.4/mapred-default.html>. You may see that the Spark SQL result can be divided into two sorted parts from the middle.
Cheng On 2/19/15 10:33 AM, Kannan Rajah wrote:
According to hive documentation, "sort by" is supposed to order the results for each reducer. So if we set a single reducer, then the results should be sorted, right? But this is not happening. Any idea why? Looks like the settings I am using to restrict the number of reducers is not having an effect. *Tried the following:* Set spark.default.parallelism to 1 Set spark.sql.shuffle.partitions to 1 These were set in hive-site.xml and also inside spark shell. *Spark-SQL* create table if not exists testSortBy (key int, name string, age int); LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE testSortBy; select * from testSortBY; 1 Aditya 28 2 aash 25 3 prashanth 27 4 bharath 26 5 terry 27 6 nanda 26 7 pradeep 27 8 pratyay 26 set spark.default.parallelism=1; set spark.sql.shuffle.partitions=1; select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth 27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age from testSortBy sort by age; aash 25 bharath 26 nanda 26 pratyay 26 prashanth 27 terry 27 pradeep 27 Aditya 28 -- Kannan