Thanks Cheng, that was helpful. I noticed from UI that only half of the
memory per executor was being used for caching, is that true? We have a 2
TB sequence file dataset that we wanted to cache in our cluster with ~ 5TB
memory but caching still failed and what looked like from the UI was that
it
Hm… Have you tuned |spark.storage.memoryFraction|? By default, 60% of
memory is used for caching. You may refer to details from here
http://spark.apache.org/docs/latest/configuration.html
On 11/15/14 5:43 AM, Sadhan Sood wrote:
Thanks Cheng, that was helpful. I noticed from UI that only half
Thanks Chneg, Just one more question - does that mean that we still need
enough memory in the cluster to uncompress the data before it can be
compressed again or does that just read the raw data as is?
On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian lian.cs@gmail.com wrote:
Currently there’s
No, the columnar buffer is built in a small batching manner, the batch
size is controlled by the |spark.sql.inMemoryColumnarStorage.batchSize|
property. The default value for this in master and branch-1.2 is 10,000
rows per batch.
On 11/14/14 1:27 AM, Sadhan Sood wrote:
Thanks Chneg, Just
Currently there’s no way to cache the compressed sequence file directly.
Spark SQL uses in-memory columnar format while caching table rows, so we
must read all the raw data and convert them into columnar format.
However, you can enable in-memory columnar compression by setting