Hi Judy,
In the case of |HadoopRDD| and |NewHadoopRDD|, partition number is
actually decided by the |InputFormat| used. And
|spark.sql.inMemoryColumnarStorage.batchSize| is not related to
partition number, it controls the in-memory columnar batch size within a
single partition.
Also, what do you mean by “change the number of partitions /after/
caching the table”? Are you trying to re-cache an already cached table
with a different partition number?
Currently, I don’t see a super intuitive pure SQL way to set the
partition number in this case. Maybe you can try this (assuming table
|t| has a column |s| which is expected to be sorted):
|SET spark.sql.shuffle.partitions =10;
CACHE TABLE cached_tAS SELECT *FROM tORDER BY s;
|
In this way, we introduce a shuffle by sorting a column, and zoom in/out
the partition number at the same time. This might not be the best way
out there, but it’s the first one that jumped into my head.
Cheng
On 3/5/15 3:51 AM, Judy Nash wrote:
Hi,
I am tuning a hive dataset on Spark SQL deployed via thrift server.
How can I change the number of partitions after caching the table on
thrift server?
I have tried the following but still getting the same number of
partitions after caching:
Spark.default.parallelism
spark.sql.inMemoryColumnarStorage.batchSize
Thanks,
Judy