Re: Spark SQL: The cached columnar table is not columnar?

Cheng Lian Thu, 08 Jan 2015 02:41:51 -0800

Hey Xuelin, which data item in the Web UI did you check?


On 1/7/15 5:37 PM, Xuelin Cao wrote:

Hi,

Curious and curious. I'm puzzled by the Spark SQL cached table.
Theoretically, the cached table should be columnar table, and onlyscan the column that included in my SQL.
However, in my test, I always see the whole table is scanned eventhough I only "select" one column in my SQL.
      Here is my code:

/val sqlContext = new org.apache.spark.sql.SQLContext(sc)
/
/import sqlContext._
/
/sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable")
/
/sqlContext.cacheTable("adTable")  //The table has > 10 columns/
/
/
///First run, cache the table into memory//
/
/sqlContext.sql("select * from adTable").collect/
/
/
///Second run, only one column is used. It should only scan a smallfraction of data//
/
/sqlContext.sql("select adId from adTable").collect /
/sqlContext.sql("select adId from adTable").collect
/
/sqlContext.sql("select adId from adTable").collect/
What I found is, every time I run the SQL, in WEB UI, it showsthe total amount of input data is always the same --- the total amountof the table.
        Is anything wrong? My expectation is:
        1. The cached table is stored as columnar table
2. Since I only need one column in my SQL, the total amount ofinput data showed in WEB UI should be very small
        But what I found is totally not the case. Why?

        Thanks

Re: Spark SQL: The cached columnar table is not columnar?

Reply via email to