Hi, Cheng In your code:
cacheTable("tbl") sql("select * from tbl").collect() sql("select name from tbl").collect() Running the first sql, the whole table is not cached yet. So the *input data will be the original json file. * After it is cached, the json format data is removed, so the total amount of data also drops. If you try like this: cacheTable("tbl") sql("select * from tbl").collect() sql("select name from tbl").collect() sql("select * from tbl").collect() Is the input data of the 3rd SQL bigger than 49.1KB? On Thu, Jan 8, 2015 at 9:36 PM, Cheng Lian <lian.cs....@gmail.com> wrote: > Weird, which version did you use? Just tried a small snippet in Spark > 1.2.0 shell as follows, the result showed in the web UI meets the > expectation quite well: > > import org.apache.spark.sql.SQLContextimport sc._ > val sqlContext = new SQLContext(sc)import sqlContext._ > > jsonFile("file:///tmp/p.json").registerTempTable("tbl") > cacheTable("tbl") > sql("select * from tbl").collect() > sql("select name from tbl").collect() > > The input data of the first statement is 292KB, the second is 49.1KB. > > The JSON file I used is examples/src/main/resources/people.json, I copied > its contents multiple times to generate a larger file. > > Cheng > > On 1/8/15 7:43 PM, Xuelin Cao wrote: > > > > Hi, Cheng > > I checked the Input data for each stage. For example, in my > attached screen snapshot, the input data is 1212.5MB, which is the total > amount of the whole table > > [image: Inline image 1] > > And, I also check the input data for each task (in the stage detail > page). And the sum of the input data for each task is also 1212.5MB > > > > > On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian <lian.cs....@gmail.com> wrote: > >> Hey Xuelin, which data item in the Web UI did you check? >> >> >> On 1/7/15 5:37 PM, Xuelin Cao wrote: >> >> >> Hi, >> >> Curious and curious. I'm puzzled by the Spark SQL cached table. >> >> Theoretically, the cached table should be columnar table, and >> only scan the column that included in my SQL. >> >> However, in my test, I always see the whole table is scanned even >> though I only "select" one column in my SQL. >> >> Here is my code: >> >> >> *val sqlContext = new org.apache.spark.sql.SQLContext(sc) * >> >> *import sqlContext._ * >> >> *sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable") * >> *sqlContext.cacheTable("adTable") //The table has > 10 columns* >> >> *//First run, cache the table into memory* >> *sqlContext.sql("select * from adTable").collect* >> >> *//Second run, only one column is used. It should only scan a small >> fraction of data* >> *sqlContext.sql("select adId from adTable").collect * >> >> *sqlContext.sql("select adId from adTable").collect * >> *sqlContext.sql("select adId from adTable").collect* >> >> What I found is, every time I run the SQL, in WEB UI, it shows >> the total amount of input data is always the same --- the total amount of >> the table. >> >> Is anything wrong? My expectation is: >> 1. The cached table is stored as columnar table >> 2. Since I only need one column in my SQL, the total amount of >> input data showed in WEB UI should be very small >> >> But what I found is totally not the case. Why? >> >> Thanks >> >> >> > >