Thanks Cheng for replying.

Meant to say to change number of partitions of a cached table. It doesn’t need 
to be re-adjusted after caching.

To provide more context:
What I am seeing on my dataset is that we have a large number of tasks. Since 
it appears each task is mapped to a partition, I want to see if matching 
partitions to available core count will make it faster.

I’ll give your suggestion a try to see if it will help. Experiment is a great 
way to learn more about spark internals.

From: Cheng Lian [mailto:lian.cs....@gmail.com]
Sent: Monday, March 16, 2015 5:41 AM
To: Judy Nash; user@spark.apache.org
Subject: Re: configure number of cached partition in memory on SparkSQL


Hi Judy,

In the case of HadoopRDD and NewHadoopRDD, partition number is actually decided 
by the InputFormat used. And spark.sql.inMemoryColumnarStorage.batchSize is not 
related to partition number, it controls the in-memory columnar batch size 
within a single partition.

Also, what do you mean by “change the number of partitions after caching the 
table”? Are you trying to re-cache an already cached table with a different 
partition number?

Currently, I don’t see a super intuitive pure SQL way to set the partition 
number in this case. Maybe you can try this (assuming table t has a column s 
which is expected to be sorted):

SET spark.sql.shuffle.partitions = 10;

CACHE TABLE cached_t AS SELECT * FROM t ORDER BY s;

In this way, we introduce a shuffle by sorting a column, and zoom in/out the 
partition number at the same time. This might not be the best way out there, 
but it’s the first one that jumped into my head.

Cheng

On 3/5/15 3:51 AM, Judy Nash wrote:
Hi,

I am tuning a hive dataset on Spark SQL deployed via thrift server.

How can I change the number of partitions created by caching the table on 
thrift server?

I have tried the following but still getting the same number of partitions 
after caching:
Spark.default.parallelism
spark.sql.inMemoryColumnarStorage.batchSize


Thanks,
Judy
​

Reply via email to