Ah, my bad... You're absolute right!
Just checked how this number is computed. It turned out that once an RDD
block is retrieved from the block manager, the size of the block is
added to the input bytes. Spark SQL's in-memory columnar format stores
all columns within a single partition into a
Hi, Cheng
In your code:
cacheTable("tbl")
sql("select * from tbl").collect() sql("select name from tbl").collect()
Running the first sql, the whole table is not cached yet. So the *input
data will be the original json file. *
After it is cached, the json format data is removed, s
Weird, which version did you use? Just tried a small snippet in Spark
1.2.0 shell as follows, the result showed in the web UI meets the
expectation quite well:
|import org.apache.spark.sql.SQLContext
import sc._
val sqlContext = new SQLContext(sc)
import sqlContext._
jsonFile("file:///
Hi, Cheng
I checked the Input data for each stage. For example, in my attached
screen snapshot, the input data is 1212.5MB, which is the total amount of
the whole table
[image: Inline image 1]
And, I also check the input data for each task (in the stage detail
page). And the sum of the
Hey Xuelin, which data item in the Web UI did you check?
On 1/7/15 5:37 PM, Xuelin Cao wrote:
Hi,
Curious and curious. I'm puzzled by the Spark SQL cached table.
Theoretically, the cached table should be columnar table, and only
scan the column that included in my SQL.
However, in my test,
Thanks Michael.
2015-01-08 6:04 GMT+08:00 Michael Armbrust :
> The cache command caches the entire table, with each column stored in its
> own byte buffer. When querying the data, only the columns that you are
> asking for are scanned in memory. I'm not sure what mechanism spark is
> using to r
The cache command caches the entire table, with each column stored in its
own byte buffer. When querying the data, only the columns that you are
asking for are scanned in memory. I'm not sure what mechanism spark is
using to report the amount of data read.
If you want to read only the data that
Hi,
Curious and curious. I'm puzzled by the Spark SQL cached table.
Theoretically, the cached table should be columnar table, and only scan
the column that included in my SQL.
However, in my test, I always see the whole table is scanned even though
I only "select" one column i