RE: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

Cheng, Hao Wed, 25 Feb 2015 16:32:55 -0800

How many reducers you set for Hive? With small data set, Hive will run in local 
mode, which will set the reducer count always as 1.

From: Kannan Rajah [mailto:kra...@maprtech.com]
Sent: Thursday, February 26, 2015 3:02 AM
To: Cheng Lian
Cc: user@spark.apache.org
Subject: Re: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

Cheng, We tried this setting and it still did not help. This was on Spark 1.2.0.

--
Kannan

On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian 
<lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:

(Move to user list.)

Hi Kannan,

You need to set mapred.map.tasks to 1 in hive-site.xml. The reason is this line 
of 
code<https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68>,
 which overrides spark.default.parallelism. Also, spark.sql.shuffle.parallelism 
isn’t used here since there’s no shuffle involved (we only need to sort within 
a partition).

Default value of mapred.map.tasks is 
2<https://hadoop.apache.org/docs/r1.0.4/mapred-default.html>. You may see that 
the Spark SQL result can be divided into two sorted parts from the middle.

Cheng

On 2/19/15 10:33 AM, Kannan Rajah wrote:

According to hive documentation, "sort by" is supposed to order the results

for each reducer. So if we set a single reducer, then the results should be

sorted, right? But this is not happening. Any idea why? Looks like the

settings I am using to restrict the number of reducers is not having an

effect.

*Tried the following:*

Set spark.default.parallelism to 1

Set spark.sql.shuffle.partitions to 1

These were set in hive-site.xml and also inside spark shell.

*Spark-SQL*

create table if not exists testSortBy (key int, name string, age int);

LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE

testSortBy;

select * from testSortBY;

1    Aditya    28

2    aash    25

3    prashanth    27

4    bharath    26

5    terry    27

6    nanda    26

7    pradeep    27

8    pratyay    26

set spark.default.parallelism=1;

set spark.sql.shuffle.partitions=1;

select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth

27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age

from testSortBy sort by age;

aash    25

bharath    26

nanda    26

pratyay    26

prashanth    27

terry    27

pradeep    27

Aditya    28

--

Kannan

RE: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

Reply via email to