subject:"configure number of cached partition in memory on SparkSQL"

RE: configure number of cached partition in memory on SparkSQL

2015-03-19 Thread Judy Nash

Thanks Cheng for replying.

Meant to say to change number of partitions of a cached table. It doesn’t need 
to be re-adjusted after caching.

To provide more context:
What I am seeing on my dataset is that we have a large number of tasks. Since 
it appears each task is mapped to a partition, I want to see if matching 
partitions to available core count will make it faster.

I’ll give your suggestion a try to see if it will help. Experiment is a great 
way to learn more about spark internals.

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Monday, March 16, 2015 5:41 AM
To: Judy Nash; user@spark.apache.org
Subject: Re: configure number of cached partition in memory on SparkSQL


Hi Judy,

In the case of HadoopRDD and NewHadoopRDD, partition number is actually decided 
by the InputFormat used. And spark.sql.inMemoryColumnarStorage.batchSize is not 
related to partition number, it controls the in-memory columnar batch size 
within a single partition.

Also, what do you mean by “change the number of partitions after caching the 
table”? Are you trying to re-cache an already cached table with a different 
partition number?

Currently, I don’t see a super intuitive pure SQL way to set the partition 
number in this case. Maybe you can try this (assuming table t has a column s 
which is expected to be sorted):

SET spark.sql.shuffle.partitions = 10;

CACHE TABLE cached_t AS SELECT * FROM t ORDER BY s;

In this way, we introduce a shuffle by sorting a column, and zoom in/out the 
partition number at the same time. This might not be the best way out there, 
but it’s the first one that jumped into my head.

Cheng

On 3/5/15 3:51 AM, Judy Nash wrote:
Hi,

I am tuning a hive dataset on Spark SQL deployed via thrift server.

How can I change the number of partitions created by caching the table on 
thrift server?

I have tried the following but still getting the same number of partitions 
after caching:
Spark.default.parallelism
spark.sql.inMemoryColumnarStorage.batchSize


Thanks,
Judy

Re: configure number of cached partition in memory on SparkSQL

2015-03-16 Thread Cheng Lian


Hi Judy,

In the case of |HadoopRDD| and |NewHadoopRDD|, partition number is 
actually decided by the |InputFormat| used. And 
|spark.sql.inMemoryColumnarStorage.batchSize| is not related to 
partition number, it controls the in-memory columnar batch size within a 
single partition.


Also, what do you mean by “change the number of partitions /after/ 
caching the table”? Are you trying to re-cache an already cached table 
with a different partition number?


Currently, I don’t see a super intuitive pure SQL way to set the 
partition number in this case. Maybe you can try this (assuming table 
|t| has a column |s| which is expected to be sorted):


|SET  spark.sql.shuffle.partitions =10;
CACHE  TABLE  cached_tAS  SELECT  *FROM  tORDER  BY  s;
|

In this way, we introduce a shuffle by sorting a column, and zoom in/out 
the partition number at the same time. This might not be the best way 
out there, but it’s the first one that jumped into my head.


Cheng

On 3/5/15 3:51 AM, Judy Nash wrote:


Hi,

I am tuning a hive dataset on Spark SQL deployed via thrift server.

How can I change the number of partitions after caching the table on 
thrift server?


I have tried the following but still getting the same number of 
partitions after caching:


Spark.default.parallelism

spark.sql.inMemoryColumnarStorage.batchSize

Thanks,

Judy

configure number of cached partition in memory on SparkSQL

2015-03-04 Thread Judy Nash

Hi,

I am tuning a hive dataset on Spark SQL deployed via thrift server.

How can I change the number of partitions after caching the table on thrift 
server?

I have tried the following but still getting the same number of partitions 
after caching:
Spark.default.parallelism
spark.sql.inMemoryColumnarStorage.batchSize


Thanks,
Judy

RE: configure number of cached partition in memory on SparkSQL

Re: configure number of cached partition in memory on SparkSQL

configure number of cached partition in memory on SparkSQL

3 matches

Site Navigation

Mail list logo

Footer information