Thanks Cheng for replying.
Meant to say to change number of partitions of a cached table. It doesn’t need
to be re-adjusted after caching.
To provide more context:
What I am seeing on my dataset is that we have a large number of tasks. Since
it appears each task is mapped to a partition, I want to see if matching
partitions to available core count will make it faster.
I’ll give your suggestion a try to see if it will help. Experiment is a great
way to learn more about spark internals.
From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Monday, March 16, 2015 5:41 AM
To: Judy Nash; user@spark.apache.org
Subject: Re: configure number of cached partition in memory on SparkSQL
Hi Judy,
In the case of HadoopRDD and NewHadoopRDD, partition number is actually decided
by the InputFormat used. And spark.sql.inMemoryColumnarStorage.batchSize is not
related to partition number, it controls the in-memory columnar batch size
within a single partition.
Also, what do you mean by “change the number of partitions after caching the
table”? Are you trying to re-cache an already cached table with a different
partition number?
Currently, I don’t see a super intuitive pure SQL way to set the partition
number in this case. Maybe you can try this (assuming table t has a column s
which is expected to be sorted):
SET spark.sql.shuffle.partitions = 10;
CACHE TABLE cached_t AS SELECT * FROM t ORDER BY s;
In this way, we introduce a shuffle by sorting a column, and zoom in/out the
partition number at the same time. This might not be the best way out there,
but it’s the first one that jumped into my head.
Cheng
On 3/5/15 3:51 AM, Judy Nash wrote:
Hi,
I am tuning a hive dataset on Spark SQL deployed via thrift server.
How can I change the number of partitions created by caching the table on
thrift server?
I have tried the following but still getting the same number of partitions
after caching:
Spark.default.parallelism
spark.sql.inMemoryColumnarStorage.batchSize
Thanks,
Judy