Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive
Could you check the Spark web UI for the number of tasks issued when the query is executed? I digged out |mapred.map.tasks| because I saw 2 tasks were issued. On 2/26/15 3:01 AM, Kannan Rajah wrote: Cheng, We tried this setting and it still did not help. This was on Spark 1.2.0. -- Kannan On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: (Move to user list.) Hi Kannan, You need to set |mapred.map.tasks| to 1 in hive-site.xml. The reason is this line of code https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68, which overrides |spark.default.parallelism|. Also, |spark.sql.shuffle.parallelism| isn’t used here since there’s no shuffle involved (we only need to sort within a partition). Default value of |mapred.map.tasks| is 2 https://hadoop.apache.org/docs/r1.0.4/mapred-default.html. You may see that the Spark SQL result can be divided into two sorted parts from the middle. Cheng On 2/19/15 10:33 AM, Kannan Rajah wrote: According to hive documentation, sort by is supposed to order the results for each reducer. So if we set a single reducer, then the results should be sorted, right? But this is not happening. Any idea why? Looks like the settings I am using to restrict the number of reducers is not having an effect. *Tried the following:* Set spark.default.parallelism to 1 Set spark.sql.shuffle.partitions to 1 These were set in hive-site.xml and also inside spark shell. *Spark-SQL* create table if not exists testSortBy (key int, name string, age int); LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE testSortBy; select * from testSortBY; 1Aditya28 2aash25 3prashanth27 4bharath26 5terry27 6nanda26 7pradeep27 8pratyay26 set spark.default.parallelism=1; set spark.sql.shuffle.partitions=1; select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth 27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age from testSortBy sort by age; aash25 bharath26 nanda26 pratyay26 prashanth27 terry27 pradeep27 Aditya28 -- Kannan
Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive
Cheng, We tried this setting and it still did not help. This was on Spark 1.2.0. -- Kannan On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian lian.cs@gmail.com wrote: (Move to user list.) Hi Kannan, You need to set mapred.map.tasks to 1 in hive-site.xml. The reason is this line of code https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68, which overrides spark.default.parallelism. Also, spark.sql.shuffle.parallelism isn’t used here since there’s no shuffle involved (we only need to sort within a partition). Default value of mapred.map.tasks is 2 https://hadoop.apache.org/docs/r1.0.4/mapred-default.html. You may see that the Spark SQL result can be divided into two sorted parts from the middle. Cheng On 2/19/15 10:33 AM, Kannan Rajah wrote: According to hive documentation, sort by is supposed to order the results for each reducer. So if we set a single reducer, then the results should be sorted, right? But this is not happening. Any idea why? Looks like the settings I am using to restrict the number of reducers is not having an effect. *Tried the following:* Set spark.default.parallelism to 1 Set spark.sql.shuffle.partitions to 1 These were set in hive-site.xml and also inside spark shell. *Spark-SQL* create table if not exists testSortBy (key int, name string, age int); LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE testSortBy; select * from testSortBY; 1Aditya28 2aash25 3prashanth27 4bharath26 5terry27 6nanda26 7pradeep27 8pratyay26 set spark.default.parallelism=1; set spark.sql.shuffle.partitions=1; select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth 27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age from testSortBy sort by age; aash25 bharath26 nanda26 pratyay26 prashanth27 terry27 pradeep27 Aditya28 -- Kannan
RE: Spark-SQL 1.2.0 sort by results are not consistent with Hive
How many reducers you set for Hive? With small data set, Hive will run in local mode, which will set the reducer count always as 1. From: Kannan Rajah [mailto:kra...@maprtech.com] Sent: Thursday, February 26, 2015 3:02 AM To: Cheng Lian Cc: user@spark.apache.org Subject: Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive Cheng, We tried this setting and it still did not help. This was on Spark 1.2.0. -- Kannan On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian lian.cs@gmail.commailto:lian.cs@gmail.com wrote: (Move to user list.) Hi Kannan, You need to set mapred.map.tasks to 1 in hive-site.xml. The reason is this line of codehttps://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68, which overrides spark.default.parallelism. Also, spark.sql.shuffle.parallelism isn’t used here since there’s no shuffle involved (we only need to sort within a partition). Default value of mapred.map.tasks is 2https://hadoop.apache.org/docs/r1.0.4/mapred-default.html. You may see that the Spark SQL result can be divided into two sorted parts from the middle. Cheng On 2/19/15 10:33 AM, Kannan Rajah wrote: According to hive documentation, sort by is supposed to order the results for each reducer. So if we set a single reducer, then the results should be sorted, right? But this is not happening. Any idea why? Looks like the settings I am using to restrict the number of reducers is not having an effect. *Tried the following:* Set spark.default.parallelism to 1 Set spark.sql.shuffle.partitions to 1 These were set in hive-site.xml and also inside spark shell. *Spark-SQL* create table if not exists testSortBy (key int, name string, age int); LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE testSortBy; select * from testSortBY; 1Aditya28 2aash25 3prashanth27 4bharath26 5terry27 6nanda26 7pradeep27 8pratyay26 set spark.default.parallelism=1; set spark.sql.shuffle.partitions=1; select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth 27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age from testSortBy sort by age; aash25 bharath26 nanda26 pratyay26 prashanth27 terry27 pradeep27 Aditya28 -- Kannan
Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive
(Move to user list.) Hi Kannan, You need to set |mapred.map.tasks| to 1 in hive-site.xml. The reason is this line of code https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68, which overrides |spark.default.parallelism|. Also, |spark.sql.shuffle.parallelism| isn’t used here since there’s no shuffle involved (we only need to sort within a partition). Default value of |mapred.map.tasks| is 2 https://hadoop.apache.org/docs/r1.0.4/mapred-default.html. You may see that the Spark SQL result can be divided into two sorted parts from the middle. Cheng On 2/19/15 10:33 AM, Kannan Rajah wrote: According to hive documentation, sort by is supposed to order the results for each reducer. So if we set a single reducer, then the results should be sorted, right? But this is not happening. Any idea why? Looks like the settings I am using to restrict the number of reducers is not having an effect. *Tried the following:* Set spark.default.parallelism to 1 Set spark.sql.shuffle.partitions to 1 These were set in hive-site.xml and also inside spark shell. *Spark-SQL* create table if not exists testSortBy (key int, name string, age int); LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE testSortBy; select * from testSortBY; 1Aditya28 2aash25 3prashanth27 4bharath26 5terry27 6nanda26 7pradeep27 8pratyay26 set spark.default.parallelism=1; set spark.sql.shuffle.partitions=1; select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth 27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age from testSortBy sort by age; aash25 bharath26 nanda26 pratyay26 prashanth27 terry27 pradeep27 Aditya28 -- Kannan