Interesting. bq. details of execution for 10 and 100 scale factor input
Looks like some chart (or image) didn't go through. FYI On Mon, Apr 25, 2016 at 12:50 PM, Ali Tootoonchian <a...@levyx.com> wrote: > Caching shuffle RDD before the sort process improves system performance. > SQL > planner can be intelligent to cache join, aggregate or sort data frame > before executing next sort process. > > For any sort process two job is created by spark, first one is responsible > for producing range boundary for shuffle partition and second one complete > sort process by creating a new shuffle RDD. > > When an input of sort process is output of other shuffle process then > reduce > part of shuffle RDD is re-evaluated and the intermediate shuffle data is > read twice. If input shuffle RDD (exchange based data frame) is saved, sort > process can be completed faster. Remember that Spark saves RDD in parquet > format which usually compressed and its size is smaller than original data. > > Let’s look at an example, > The following query is modified version of q3 of TPCH test bench. > tpchQuery = > """ > |select * > |from > | customer, > | orders, > | lineitem > |where > | c_mktsegment = 'MACHINERY' > | and c_custkey = o_custkey > | and l_orderkey = o_orderkey > | and o_orderdate < '1995-03-15' > | and l_shipdate > '1995-03-15' > |order by > | o_orderdate > """.stripMargin > > The query can be executed in one step using current Spark SQL planner. The > other approach for execute this query is two steps. > Compute and cache output of join process > Execute order by command > Following command show how second approach can be implemented > > tpchQuery = > """ > |select * > |from > | customer, > | orders, > | lineitem > |where > | c_mktsegment = 'MACHINERY' > | and c_custkey = o_custkey > | and l_orderkey = o_orderkey > | and o_orderdate < '1995-03-15' > | and l_shipdate > '1995-03-15' > """.stripMargin > val joinDf = sqlContext.sql(tpchQuery).cache > val queryRes = joinDf.sort("o_orderdate") > > Let’s look at details of execution for 10 and 100 scale factor input > > > By comparing stage 4, 9, 10 and 15, 20, 21 of two approaches, you can find > out that amount of data is read during sort process can be reduced by > factor > 2. > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Cache-Shuffle-Based-Operation-Before-Sort-tp17331.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >