[jira] [Commented] (SPARK-12179) Spark SQL get different result with the same code

Tao Li (JIRA) Wed, 09 Dec 2015 03:06:48 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-12179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048496#comment-15048496
 ]


Tao Li commented on SPARK-12179:
--------------------------------

I try to use spark internal Row_Number() udf, but still has the problem.

select $DATE as date, 'main' as type, host,  rfhost, rfpv
from (
  select Row_Number() OVER (partition by host ORDER BY host ,rfpv desc) r, 
host, rfhost, rfpv
  from (
     select delhost(t0.host) as host, delhost(t0.rfhost) as rfhost 
,sum(t0.rfpv) as rfpv
      from (
        select h.host as host,i.rfhost as rfhost ,i.rfpv as rfpv
        from (
          select parse_url(ur,'HOST') as host,count(1) as pv
          from custom.web_sogourank_orc_zlib
          where logdate>=$starttime and logdate<=$endtime
          group by parse_url(ur,'HOST') order by pv desc  limit 10000
        ) h left outer join (
          select parse_url(ur,'HOST') as host,parse_url(rf,'HOST') as rfhost , 
count(*) as rfpv
          from custom.web_sogourank_orc_zlib where logdate>=$starttime and 
logdate<=$endtime
          group by parse_url(ur,'HOST'), parse_url(rf,'HOST')
        ) i
        on h.host = i.host ) t0
      group by delhost(t0.host),delhost(t0.rfhost)
    distribute by host sort by host ,rfpv desc
  ) t1
) t2 where r<=10

> Spark SQL get different result with the same code
> -------------------------------------------------
>
>                 Key: SPARK-12179
>                 URL: https://issues.apache.org/jira/browse/SPARK-12179
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 
> 1.5.2, 1.5.3
>         Environment: hadoop version: 2.5.0-cdh5.3.2
> spark version: 1.5.3
> run mode: yarn-client
>            Reporter: Tao Li
>            Priority: Critical
>
> I run the sql in yarn-client mode, but get different result each time.
> As you can see the example, I get the different shuffle write with the same 
> shuffle read in two jobs with the same code.
> Some of my spark app runs well, but some always met this problem. And I met 
> this problem on spark 1.3, 1.4 and 1.5 version.
> Can you give me some suggestions about the possible causes or how do I figure 
> out the problem?
> 1. First Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.8 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54934
> 2. Second Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.6 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54905



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12179) Spark SQL get different result with the same code

Reply via email to