Re: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

Cheng Lian Mon, 23 Feb 2015 18:39:58 -0800

(Move to user list.)

Hi Kannan,

You need to set |mapred.map.tasks| to 1 in hive-site.xml. The reason isthis line of code<https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68>,which overrides |spark.default.parallelism|. Also,|spark.sql.shuffle.parallelism| isn’t used here since there’s no shuffleinvolved (we only need to sort within a partition).

Default value of |mapred.map.tasks| is 2<https://hadoop.apache.org/docs/r1.0.4/mapred-default.html>. You may seethat the Spark SQL result can be divided into two sorted parts from themiddle.


Cheng

On 2/19/15 10:33 AM, Kannan Rajah wrote:

According to hive documentation, "sort by" is supposed to order the results
for each reducer. So if we set a single reducer, then the results should be
sorted, right? But this is not happening. Any idea why? Looks like the
settings I am using to restrict the number of reducers is not having an
effect.

*Tried the following:*

Set spark.default.parallelism to 1

Set spark.sql.shuffle.partitions to 1

These were set in hive-site.xml and also inside spark shell.


*Spark-SQL*

create table if not exists testSortBy (key int, name string, age int);
LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE
testSortBy;
select * from testSortBY;

1    Aditya    28
2    aash    25
3    prashanth    27
4    bharath    26
5    terry    27
6    nanda    26
7    pradeep    27
8    pratyay    26


set spark.default.parallelism=1;

set spark.sql.shuffle.partitions=1;

select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth
27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age
from testSortBy sort by age;

aash    25
bharath    26
nanda    26
pratyay    26
prashanth    27
terry    27
pradeep    27
Aditya    28


--
Kannan

Re: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

Reply via email to