Re: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

Cheng Lian Thu, 26 Feb 2015 01:40:12 -0800

Could you check the Spark web UI for the number of tasks issued when thequery is executed? I digged out |mapred.map.tasks| because I saw 2 taskswere issued.


On 2/26/15 3:01 AM, Kannan Rajah wrote:

Cheng, We tried this setting and it still did not help. This was onSpark 1.2.0.



--
Kannan

On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian <lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:


    (Move to user list.)

    Hi Kannan,

    You need to set |mapred.map.tasks| to 1 in hive-site.xml. The
    reason is this line of code
    
<https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68>,
    which overrides |spark.default.parallelism|. Also,
    |spark.sql.shuffle.parallelism| isn’t used here since there’s no
    shuffle involved (we only need to sort within a partition).

    Default value of |mapred.map.tasks| is 2
    <https://hadoop.apache.org/docs/r1.0.4/mapred-default.html>. You
    may see that the Spark SQL result can be divided into two sorted
    parts from the middle.

    Cheng

    On 2/19/15 10:33 AM, Kannan Rajah wrote:

    According to hive documentation, "sort by" is supposed to order the results
    for each reducer. So if we set a single reducer, then the results should be
    sorted, right? But this is not happening. Any idea why? Looks like the
    settings I am using to restrict the number of reducers is not having an
    effect.

    *Tried the following:*

    Set spark.default.parallelism to 1

    Set spark.sql.shuffle.partitions to 1

    These were set in hive-site.xml and also inside spark shell.


    *Spark-SQL*

    create table if not exists testSortBy (key int, name string, age int);
    LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE
    testSortBy;
    select * from testSortBY;

    1    Aditya    28
    2    aash    25
    3    prashanth    27
    4    bharath    26
    5    terry    27
    6    nanda    26
    7    pradeep    27
    8    pratyay    26


    set spark.default.parallelism=1;

    set spark.sql.shuffle.partitions=1;

    select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth
    27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age
    from testSortBy sort by age;

    aash    25
    bharath    26
    nanda    26
    pratyay    26
    prashanth    27
    terry    27
    pradeep    27
    Aditya    28


    --
    Kannan

Re: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

Reply via email to