Re: Cache sparkSql data without uncompressing it in memory

Cheng Lian Fri, 14 Nov 2014 20:52:47 -0800

Hm… Have you tuned |spark.storage.memoryFraction|? By default, 60% ofmemory is used for caching. You may refer to details from herehttp://spark.apache.org/docs/latest/configuration.html


On 11/15/14 5:43 AM, Sadhan Sood wrote:

Thanks Cheng, that was helpful. I noticed from UI that only half ofthe memory per executor was being used for caching, is that true? Wehave a 2 TB sequence file dataset that we wanted to cache in ourcluster with ~ 5TB memory but caching still failed and what lookedlike from the UI was that it used 2.5 TB of memory and almost wrote 12TB to disk (at which point it was useless) during the mapPartitionstage. Also, couldn't run more than 2 executors/box (60g memory/box)or else it died very quickly from lesser memory/executor (not surewhy?) although I/O seemed to be going much faster which makes sensebecause of more parallel reads.
On Thu, Nov 13, 2014 at 10:50 PM, Cheng Lian <lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:
    No, the columnar buffer is built in a small batching manner, the
    batch size is controlled by the
    |spark.sql.inMemoryColumnarStorage.batchSize| property. The
    default value for this in master and branch-1.2 is 10,000 rows per
    batch.

    On 11/14/14 1:27 AM, Sadhan Sood wrote:
    Thanks Chneg, Just one more question - does that mean that we
    still need enough memory in the cluster to uncompress the data
    before it can be compressed again or does that just read the raw
    data as is?

    On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian
    <lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:

        Currently there’s no way to cache the compressed sequence
        file directly. Spark SQL uses in-memory columnar format while
        caching table rows, so we must read all the raw data and
        convert them into columnar format. However, you can enable
        in-memory columnar compression by setting
        |spark.sql.inMemoryColumnarStorage.compressed| to |true|.
        This property is already set to true by default in master
        branch and branch-1.2.

        On 11/13/14 7:16 AM, Sadhan Sood wrote:
        We noticed while caching data from our hive tables which
        contain data in compressed sequence file format that it gets
        uncompressed in memory when getting cached. Is there a way
        to turn this off and cache the compressed data as is ?
        
    

Re: Cache sparkSql data without uncompressing it in memory

Reply via email to