Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Xuelin Cao
Hi, Cheng I checked the Input data for each stage. For example, in my attached screen snapshot, the input data is 1212.5MB, which is the total amount of the whole table [image: Inline image 1] And, I also check the input data for each task (in the stage detail page). And the sum of

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian
Hey Xuelin, which data item in the Web UI did you check? On 1/7/15 5:37 PM, Xuelin Cao wrote: Hi, Curious and curious. I'm puzzled by the Spark SQL cached table. Theoretically, the cached table should be columnar table, and only scan the column that included in my SQL. However, in my

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian
Weird, which version did you use? Just tried a small snippet in Spark 1.2.0 shell as follows, the result showed in the web UI meets the expectation quite well: |import org.apache.spark.sql.SQLContext import sc._ val sqlContext = new SQLContext(sc) import sqlContext._

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Xuelin Cao
Hi, Cheng In your code: cacheTable(tbl) sql(select * from tbl).collect() sql(select name from tbl).collect() Running the first sql, the whole table is not cached yet. So the *input data will be the original json file. * After it is cached, the json format data is removed, so the

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian
Ah, my bad... You're absolute right! Just checked how this number is computed. It turned out that once an RDD block is retrieved from the block manager, the size of the block is added to the input bytes. Spark SQL's in-memory columnar format stores all columns within a single partition into a

Spark SQL: The cached columnar table is not columnar?

2015-01-07 Thread Xuelin Cao
Hi,        Curious and curious. I'm puzzled by the Spark SQL cached table.       Theoretically, the cached table should be columnar table, and only scan the column that included in my SQL.       However, in my test, I always see the whole table is scanned even though I only select one column in

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-07 Thread Michael Armbrust
The cache command caches the entire table, with each column stored in its own byte buffer. When querying the data, only the columns that you are asking for are scanned in memory. I'm not sure what mechanism spark is using to report the amount of data read. If you want to read only the data that

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-07 Thread 曹雪林
Thanks Michael. 2015-01-08 6:04 GMT+08:00 Michael Armbrust mich...@databricks.com: The cache command caches the entire table, with each column stored in its own byte buffer. When querying the data, only the columns that you are asking for are scanned in memory. I'm not sure what mechanism