Hi Judy,

In the case of |HadoopRDD| and |NewHadoopRDD|, partition number is actually decided by the |InputFormat| used. And |spark.sql.inMemoryColumnarStorage.batchSize| is not related to partition number, it controls the in-memory columnar batch size within a single partition.

Also, what do you mean by “change the number of partitions /after/ caching the table”? Are you trying to re-cache an already cached table with a different partition number?

Currently, I don’t see a super intuitive pure SQL way to set the partition number in this case. Maybe you can try this (assuming table |t| has a column |s| which is expected to be sorted):

|SET  spark.sql.shuffle.partitions =10;
CACHE  TABLE  cached_tAS  SELECT  *FROM  tORDER  BY  s;
|

In this way, we introduce a shuffle by sorting a column, and zoom in/out the partition number at the same time. This might not be the best way out there, but it’s the first one that jumped into my head.

Cheng

On 3/5/15 3:51 AM, Judy Nash wrote:

Hi,

I am tuning a hive dataset on Spark SQL deployed via thrift server.

How can I change the number of partitions after caching the table on thrift server?

I have tried the following but still getting the same number of partitions after caching:

Spark.default.parallelism

spark.sql.inMemoryColumnarStorage.batchSize

Thanks,

Judy

Reply via email to