[jira] [Commented] (SPARK-12179) Spark SQL get different result with the same code

Tao Li (JIRA) Mon, 07 Dec 2015 06:33:03 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-12179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15045014#comment-15045014
 ]


Tao Li commented on SPARK-12179:
--------------------------------

[~sowen] My sql and command line param is like this:

DATE=$1
starttime=$DATE"00"
endtime=$DATE"23"

sql="
select $DATE as date, 'main' as type, host,  rfhost, rfpv
from (
  select row_number(t1.host) r, host, rfhost, rfpv
  from (
     select delhost(t0.host) as host, delhost(t0.rfhost) as rfhost 
,sum(t0.rfpv) as rfpv
      from (
        select h.host as host,i.rfhost as rfhost ,i.rfpv as rfpv
        from (
          select parse_url(ur,'HOST') as host,count(1) as pv
          from mytable
          where logdate>=$starttime and logdate<=$endtime
          group by parse_url(ur,'HOST') order by pv desc  limit 10000
        ) h left outer join (
          select parse_url(ur,'HOST') as host,parse_url(rf,'HOST') as rfhost , 
count(*) as rfpv
          from mytable where logdate>=$starttime and logdate<=$endtime
          group by parse_url(ur,'HOST'), parse_url(rf,'HOST')
        ) i
        on h.host = i.host ) t0
      group by delhost(t0.host),delhost(t0.rfhost)
    distribute by host sort by host ,rfpv desc
  ) t1
) t2 where r<=10
"

/opt/spark/bin/spark-sql \
--master  yarn-client  \
  --executor-memory 5G --num-executors 70 --executor-cores 1 --conf 
spark.yarn.executor.memoryOverhead=2048 --conf 
spark.executor.extraJavaOptions="-XX:MaxPermSize=256m 
-XX:+CMSClassUnloadingEnabled -XX:MaxDirectMemorySize=1536m 
-XX:MaxTenuringThreshold=1 -Xmn100m -XX:+PrintGC -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC 
-XX:+PrintGCApplicationConcurrentTime -Xloggc:gc.log 
-XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC 
-XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection 
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=10 
-XX:+UseCompressedOops" \
  --driver-memory 3G --conf spark.driver.maxResultSize=2G --conf 
spark.driver.extraJavaOptions="-XX:MaxPermSize=256m 
-XX:+CMSClassUnloadingEnabled -XX:+PrintGC -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime 
-Xloggc:gc.log -XX:+HeapDumpOnOutOfMemoryError" \
  --conf spark.yarn.am.memory=2G --conf 
spark.yarn.am.extraJavaOptions="-XX:MaxPermSize=125m 
-XX:+CMSClassUnloadingEnabled"  \
  --conf spark.sql.shuffle.partitions=2000 \
  --conf spark.executor.userClassPathFirst=true \
  -i init.hql   -e "${sql}" -S > log.$DATE

> Spark SQL get different result with the same code
> -------------------------------------------------
>
>                 Key: SPARK-12179
>                 URL: https://issues.apache.org/jira/browse/SPARK-12179
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 
> 1.5.2, 1.5.3
>         Environment: hadoop version: 2.5.0-cdh5.3.2
> spark version: 1.5.3
> run mode: yarn-client
>            Reporter: Tao Li
>            Priority: Minor
>
> I run the sql in yarn-client mode, but get different result each time.
> As you can see the example, I get the different shuffle write with the same 
> shuffle read in two jobs with the same code.
> Some of my spark app runs well, but some always met this problem. And I met 
> this problem on spark 1.3, 1.4 and 1.5 version.
> Can you give me some suggestions about the possible causes or how do I figure 
> out the problem?
> 1. First Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.8 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54934
> 2. Second Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.6 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54905



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12179) Spark SQL get different result with the same code

Reply via email to