The cache command caches the entire table, with each column stored in its
own byte buffer.  When querying the data, only the columns that you are
asking for are scanned in memory.  I'm not sure what mechanism spark is
using to report the amount of data read.

If you want to read only the data that you are looking for off of the disk,
I'd suggest looking at parquet.

On Wed, Jan 7, 2015 at 1:37 AM, Xuelin Cao <xuelin...@yahoo.com.invalid>
wrote:

>
> Hi,
>
>       Curious and curious. I'm puzzled by the Spark SQL cached table.
>
>       Theoretically, the cached table should be columnar table, and only
> scan the column that included in my SQL.
>
>       However, in my test, I always see the whole table is scanned even
> though I only "select" one column in my SQL.
>
>       Here is my code:
>
>
> *val sqlContext = new org.apache.spark.sql.SQLContext(sc)*
>
> *import sqlContext._*
>
> *sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable")*
> *sqlContext.cacheTable("adTable")  //The table has > 10 columns*
>
> *//First run, cache the table into memory*
> *sqlContext.sql("select * from adTable").collect*
>
> *//Second run, only one column is used. It should only scan a small
> fraction of data*
> *sqlContext.sql("select adId from adTable").collect *
>
> *sqlContext.sql("select adId from adTable").collect*
> *sqlContext.sql("select adId from adTable").collect*
>
>         What I found is, every time I run the SQL, in WEB UI, it shows the
> total amount of input data is always the same --- the total amount of the
> table.
>
>         Is anything wrong? My expectation is:
>         1. The cached table is stored as columnar table
>         2. Since I only need one column in my SQL, the total amount of
> input data showed in WEB UI should be very small
>
>         But what I found is totally not the case. Why?
>
>         Thanks
>
>

Reply via email to