subject:"RE\: Spark\-SQL 1.2.0 sort by results are not consistent with Hive"

Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-26 Thread Cheng Lian

Could you check the Spark web UI for the number of tasks issued when the
query is executed? I digged out |mapred.map.tasks| because I saw 2 tasks
were issued.

On 2/26/15 3:01 AM, Kannan Rajah wrote:

Cheng, We tried this setting and it still did not help. This was on
Spark 1.2.0.

--
Kannan

On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com wrote:

(Move to user list.)

Hi Kannan,

You need to set |mapred.map.tasks| to 1 in hive-site.xml. The
reason is this line of code

https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68,
which overrides |spark.default.parallelism|. Also,
|spark.sql.shuffle.parallelism| isn’t used here since there’s no
shuffle involved (we only need to sort within a partition).

Default value of |mapred.map.tasks| is 2
https://hadoop.apache.org/docs/r1.0.4/mapred-default.html. You
may see that the Spark SQL result can be divided into two sorted
parts from the middle.

Cheng

On 2/19/15 10:33 AM, Kannan Rajah wrote:

According to hive documentation, sort by is supposed to order the results
for each reducer. So if we set a single reducer, then the results should be
sorted, right? But this is not happening. Any idea why? Looks like the
settings I am using to restrict the number of reducers is not having an
effect.

*Tried the following:*

Set spark.default.parallelism to 1

Set spark.sql.shuffle.partitions to 1

These were set in hive-site.xml and also inside spark shell.

*Spark-SQL*

create table if not exists testSortBy (key int, name string, age int);
LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE
testSortBy;
select * from testSortBY;

1Aditya28
2aash25
3prashanth27
4bharath26
5terry27
6nanda26
7pradeep27
8pratyay26

set spark.default.parallelism=1;

set spark.sql.shuffle.partitions=1;

select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth
27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age
from testSortBy sort by age;

aash25
bharath26
nanda26
pratyay26
prashanth27
terry27
pradeep27
Aditya28

--
Kannan

Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-25 Thread Kannan Rajah

Cheng, We tried this setting and it still did not help. This was on Spark
1.2.0.


--
Kannan

On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian lian.cs@gmail.com wrote:

  (Move to user list.)

 Hi Kannan,

 You need to set mapred.map.tasks to 1 in hive-site.xml. The reason is this
 line of code
 https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68,
 which overrides spark.default.parallelism. Also,
 spark.sql.shuffle.parallelism isn’t used here since there’s no shuffle
 involved (we only need to sort within a partition).

 Default value of mapred.map.tasks is 2
 https://hadoop.apache.org/docs/r1.0.4/mapred-default.html. You may see
 that the Spark SQL result can be divided into two sorted parts from the
 middle.

 Cheng

 On 2/19/15 10:33 AM, Kannan Rajah wrote:

   According to hive documentation, sort by is supposed to order the results
 for each reducer. So if we set a single reducer, then the results should be
 sorted, right? But this is not happening. Any idea why? Looks like the
 settings I am using to restrict the number of reducers is not having an
 effect.

 *Tried the following:*

 Set spark.default.parallelism to 1

 Set spark.sql.shuffle.partitions to 1

 These were set in hive-site.xml and also inside spark shell.


 *Spark-SQL*

 create table if not exists testSortBy (key int, name string, age int);
 LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE
 testSortBy;
 select * from testSortBY;

 1Aditya28
 2aash25
 3prashanth27
 4bharath26
 5terry27
 6nanda26
 7pradeep27
 8pratyay26


 set spark.default.parallelism=1;

 set spark.sql.shuffle.partitions=1;

 select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth
 27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age
 from testSortBy sort by age;

 aash25
 bharath26
 nanda26
 pratyay26
 prashanth27
 terry27
 pradeep27
 Aditya28


 --
 Kannan

RE: Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-25 Thread Cheng, Hao

How many reducers you set for Hive? With small data set, Hive will run in local 
mode, which will set the reducer count always as 1.

From: Kannan Rajah [mailto:kra...@maprtech.com]
Sent: Thursday, February 26, 2015 3:02 AM
To: Cheng Lian
Cc: user@spark.apache.org
Subject: Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive

Cheng, We tried this setting and it still did not help. This was on Spark 1.2.0.

--
Kannan

On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian 
lian.cs@gmail.commailto:lian.cs@gmail.com wrote:

(Move to user list.)

Hi Kannan,

You need to set mapred.map.tasks to 1 in hive-site.xml. The reason is this line 
of 
codehttps://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68,
 which overrides spark.default.parallelism. Also, spark.sql.shuffle.parallelism 
isn’t used here since there’s no shuffle involved (we only need to sort within 
a partition).

Default value of mapred.map.tasks is 
2https://hadoop.apache.org/docs/r1.0.4/mapred-default.html. You may see that 
the Spark SQL result can be divided into two sorted parts from the middle.

Cheng

On 2/19/15 10:33 AM, Kannan Rajah wrote:

According to hive documentation, sort by is supposed to order the results

for each reducer. So if we set a single reducer, then the results should be

sorted, right? But this is not happening. Any idea why? Looks like the

settings I am using to restrict the number of reducers is not having an

effect.

*Tried the following:*

Set spark.default.parallelism to 1

Set spark.sql.shuffle.partitions to 1

These were set in hive-site.xml and also inside spark shell.

*Spark-SQL*

create table if not exists testSortBy (key int, name string, age int);

LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE

testSortBy;

select * from testSortBY;

1Aditya28

2aash25

3prashanth27

4bharath26

5terry27

6nanda26

7pradeep27

8pratyay26

set spark.default.parallelism=1;

set spark.sql.shuffle.partitions=1;

select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth

27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age

from testSortBy sort by age;

aash25

bharath26

nanda26

pratyay26

prashanth27

terry27

pradeep27

Aditya28

--

Kannan

Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-23 Thread Cheng Lian

(Move to user list.)

Hi Kannan,

You need to set |mapred.map.tasks| to 1 in hive-site.xml. The reason is
this line of code
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68,
which overrides |spark.default.parallelism|. Also,
|spark.sql.shuffle.parallelism| isn’t used here since there’s no shuffle
involved (we only need to sort within a partition).

Default value of |mapred.map.tasks| is 2
https://hadoop.apache.org/docs/r1.0.4/mapred-default.html. You may see
that the Spark SQL result can be divided into two sorted parts from the
middle.

Cheng

On 2/19/15 10:33 AM, Kannan Rajah wrote:

*Tried the following:*

Set spark.default.parallelism to 1

Set spark.sql.shuffle.partitions to 1

These were set in hive-site.xml and also inside spark shell.

*Spark-SQL*

create table if not exists testSortBy (key int, name string, age int);
LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE
testSortBy;
select * from testSortBY;

1Aditya28
2aash25
3prashanth27
4bharath26
5terry27
6nanda26
7pradeep27
8pratyay26

set spark.default.parallelism=1;

set spark.sql.shuffle.partitions=1;

select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth
27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age
from testSortBy sort by age;

aash25
bharath26
nanda26
pratyay26
prashanth27
terry27
pradeep27
Aditya28

--
Kannan

Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive

Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive

RE: Spark-SQL 1.2.0 sort by results are not consistent with Hive

Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive

4 matches

Site Navigation

Mail list logo

Footer information