Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-26 Thread Cheng Lian
Could you check the Spark web UI for the number of tasks issued when the 
query is executed? I digged out |mapred.map.tasks| because I saw 2 tasks 
were issued.


On 2/26/15 3:01 AM, Kannan Rajah wrote:

Cheng, We tried this setting and it still did not help. This was on 
Spark 1.2.0.



--
Kannan

On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian lian.cs@gmail.com 
mailto:lian.cs@gmail.com wrote:


(Move to user list.)

Hi Kannan,

You need to set |mapred.map.tasks| to 1 in hive-site.xml. The
reason is this line of code

https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68,
which overrides |spark.default.parallelism|. Also,
|spark.sql.shuffle.parallelism| isn’t used here since there’s no
shuffle involved (we only need to sort within a partition).

Default value of |mapred.map.tasks| is 2
https://hadoop.apache.org/docs/r1.0.4/mapred-default.html. You
may see that the Spark SQL result can be divided into two sorted
parts from the middle.

Cheng

On 2/19/15 10:33 AM, Kannan Rajah wrote:


According to hive documentation, sort by is supposed to order the results
for each reducer. So if we set a single reducer, then the results should be
sorted, right? But this is not happening. Any idea why? Looks like the
settings I am using to restrict the number of reducers is not having an
effect.

*Tried the following:*

Set spark.default.parallelism to 1

Set spark.sql.shuffle.partitions to 1

These were set in hive-site.xml and also inside spark shell.


*Spark-SQL*

create table if not exists testSortBy (key int, name string, age int);
LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE
testSortBy;
select * from testSortBY;

1Aditya28
2aash25
3prashanth27
4bharath26
5terry27
6nanda26
7pradeep27
8pratyay26


set spark.default.parallelism=1;

set spark.sql.shuffle.partitions=1;

select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth
27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age
from testSortBy sort by age;

aash25
bharath26
nanda26
pratyay26
prashanth27
terry27
pradeep27
Aditya28


--
Kannan


​



​


Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-25 Thread Kannan Rajah
Cheng, We tried this setting and it still did not help. This was on Spark
1.2.0.


--
Kannan

On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian lian.cs@gmail.com wrote:

  (Move to user list.)

 Hi Kannan,

 You need to set mapred.map.tasks to 1 in hive-site.xml. The reason is this
 line of code
 https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68,
 which overrides spark.default.parallelism. Also,
 spark.sql.shuffle.parallelism isn’t used here since there’s no shuffle
 involved (we only need to sort within a partition).

 Default value of mapred.map.tasks is 2
 https://hadoop.apache.org/docs/r1.0.4/mapred-default.html. You may see
 that the Spark SQL result can be divided into two sorted parts from the
 middle.

 Cheng

 On 2/19/15 10:33 AM, Kannan Rajah wrote:

   According to hive documentation, sort by is supposed to order the results
 for each reducer. So if we set a single reducer, then the results should be
 sorted, right? But this is not happening. Any idea why? Looks like the
 settings I am using to restrict the number of reducers is not having an
 effect.

 *Tried the following:*

 Set spark.default.parallelism to 1

 Set spark.sql.shuffle.partitions to 1

 These were set in hive-site.xml and also inside spark shell.


 *Spark-SQL*

 create table if not exists testSortBy (key int, name string, age int);
 LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE
 testSortBy;
 select * from testSortBY;

 1Aditya28
 2aash25
 3prashanth27
 4bharath26
 5terry27
 6nanda26
 7pradeep27
 8pratyay26


 set spark.default.parallelism=1;

 set spark.sql.shuffle.partitions=1;

 select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth
 27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age
 from testSortBy sort by age;

 aash25
 bharath26
 nanda26
 pratyay26
 prashanth27
 terry27
 pradeep27
 Aditya28


 --
 Kannan


   ​



RE: Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-25 Thread Cheng, Hao
How many reducers you set for Hive? With small data set, Hive will run in local 
mode, which will set the reducer count always as 1.

From: Kannan Rajah [mailto:kra...@maprtech.com]
Sent: Thursday, February 26, 2015 3:02 AM
To: Cheng Lian
Cc: user@spark.apache.org
Subject: Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive

Cheng, We tried this setting and it still did not help. This was on Spark 1.2.0.


--
Kannan

On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian 
lian.cs@gmail.commailto:lian.cs@gmail.com wrote:

(Move to user list.)

Hi Kannan,

You need to set mapred.map.tasks to 1 in hive-site.xml. The reason is this line 
of 
codehttps://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68,
 which overrides spark.default.parallelism. Also, spark.sql.shuffle.parallelism 
isn’t used here since there’s no shuffle involved (we only need to sort within 
a partition).

Default value of mapred.map.tasks is 
2https://hadoop.apache.org/docs/r1.0.4/mapred-default.html. You may see that 
the Spark SQL result can be divided into two sorted parts from the middle.

Cheng

On 2/19/15 10:33 AM, Kannan Rajah wrote:

According to hive documentation, sort by is supposed to order the results

for each reducer. So if we set a single reducer, then the results should be

sorted, right? But this is not happening. Any idea why? Looks like the

settings I am using to restrict the number of reducers is not having an

effect.



*Tried the following:*



Set spark.default.parallelism to 1



Set spark.sql.shuffle.partitions to 1



These were set in hive-site.xml and also inside spark shell.





*Spark-SQL*



create table if not exists testSortBy (key int, name string, age int);

LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE

testSortBy;

select * from testSortBY;



1Aditya28

2aash25

3prashanth27

4bharath26

5terry27

6nanda26

7pradeep27

8pratyay26





set spark.default.parallelism=1;



set spark.sql.shuffle.partitions=1;



select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth

27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age

from testSortBy sort by age;



aash25

bharath26

nanda26

pratyay26

prashanth27

terry27

pradeep27

Aditya28





--

Kannan


​



Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-23 Thread Cheng Lian

(Move to user list.)

Hi Kannan,

You need to set |mapred.map.tasks| to 1 in hive-site.xml. The reason is 
this line of code 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68, 
which overrides |spark.default.parallelism|. Also, 
|spark.sql.shuffle.parallelism| isn’t used here since there’s no shuffle 
involved (we only need to sort within a partition).


Default value of |mapred.map.tasks| is 2 
https://hadoop.apache.org/docs/r1.0.4/mapred-default.html. You may see 
that the Spark SQL result can be divided into two sorted parts from the 
middle.


Cheng

On 2/19/15 10:33 AM, Kannan Rajah wrote:


According to hive documentation, sort by is supposed to order the results
for each reducer. So if we set a single reducer, then the results should be
sorted, right? But this is not happening. Any idea why? Looks like the
settings I am using to restrict the number of reducers is not having an
effect.

*Tried the following:*

Set spark.default.parallelism to 1

Set spark.sql.shuffle.partitions to 1

These were set in hive-site.xml and also inside spark shell.


*Spark-SQL*

create table if not exists testSortBy (key int, name string, age int);
LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE
testSortBy;
select * from testSortBY;

1Aditya28
2aash25
3prashanth27
4bharath26
5terry27
6nanda26
7pradeep27
8pratyay26


set spark.default.parallelism=1;

set spark.sql.shuffle.partitions=1;

select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth
27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age
from testSortBy sort by age;

aash25
bharath26
nanda26
pratyay26
prashanth27
terry27
pradeep27
Aditya28


--
Kannan


​