Re: SQLCtx cacheTable

Michael Armbrust Mon, 04 Aug 2014 13:58:26 -0700

If mesos is allocating a container that is exactly the same as the max heap
size then that is leaving no buffer space for non-heap JVM memory, which
seems wrong to me.


The problem here is that cacheTable is more aggressive about grabbing large
ByteBuffers during caching (which it later releases when it knows the exact
size of the data)  There is a discussion here about trying to improve this:
https://issues.apache.org/jira/browse/SPARK-2650


On Sun, Aug 3, 2014 at 11:35 PM, Gurvinder Singh <gurvinder.si...@uninett.no
> wrote:

> On 08/03/2014 02:33 AM, Michael Armbrust wrote:
> > I am not a mesos expert... but it sounds like there is some mismatch
> > between the size that mesos is giving you and the maximum heap size of
> > the executors (-Xmx).
> >
> It seems that mesos is giving the correct size to java process. It has
> exact size set in -Xms/-Xmx params. Do you if somehow I can find which
> class or thread inside the spark jvm process is using how much memory
> and see which makes it to reach the memory limit on CacheTable case
> where as not in cache RDD case.
>
> - Gurvinder
> >
> > On Fri, Aug 1, 2014 at 12:07 AM, Gurvinder Singh
> > <gurvinder.si...@uninett.no <mailto:gurvinder.si...@uninett.no>> wrote:
> >
> >     It is not getting out of memory exception. I am using Mesos as
> cluster
> >     manager and it says when I use cacheTable that the container has used
> >     all of its allocated memory and thus kill it. I can see it in the
> logs
> >     on mesos-slave where executor runs. But on the web UI of spark
> >     application, it shows that is still have 4-5GB space left for
> >     caching/storing. So I am wondering how the memory is handled in
> >     cacheTable case. Does it reserve the memory storage and other parts
> run
> >     out of their memory. I also tries to change the
> >     "spark.storage.memoryFraction" but that did not help.
> >
> >     - Gurvinder
> >     On 08/01/2014 08:42 AM, Michael Armbrust wrote:
> >     > Are you getting OutOfMemoryExceptions with cacheTable? or what do
> you
> >     > mean when you say you have to specify larger executor memory?  You
> >     might
> >     > be running into SPARK-2650
> >     > <https://issues.apache.org/jira/browse/SPARK-2650>.
> >     >
> >     > Is there something else you are trying to accomplish by setting the
> >     > persistence level?  If you are looking for something like
> >     DISK_ONLY you
> >     > can simulate that now using saveAsParquetFile and parquetFile.
> >     >
> >     > It is possible long term that we will automatically map the
> >     standard RDD
> >     > persistence levels to these more efficient implementations in the
> >     future.
> >     >
> >     >
> >     > On Thu, Jul 31, 2014 at 11:26 PM, Gurvinder Singh
> >     > <gurvinder.si...@uninett.no <mailto:gurvinder.si...@uninett.no>
> >     <mailto:gurvinder.si...@uninett.no
> >     <mailto:gurvinder.si...@uninett.no>>> wrote:
> >     >
> >     >     Thanks Michael for explaination. Actually I tried caching the
> >     RDD and
> >     >     making table on it. But the performance for cacheTable was 3X
> >     better
> >     >     than caching RDD. Now I know why it is better. But is it
> >     possible to
> >     >     add the support for persistence level into cacheTable itself
> >     like RDD.
> >     >     May be it is not related, but on the same size of data set,
> >     when I use
> >     >     cacheTable I have to specify larger executor memory than I
> need in
> >     >     case of caching RDD. Although in the storage tab on status web
> >     UI, the
> >     >     memory footprint is almost same 58.3 GB in cacheTable and
> >     59.7GB in
> >     >     cache RDD. Is it possible that there is some memory leak or
> >     cacheTable
> >     >     works differently and thus require higher memory. The
> >     difference is
> >     >     5GB per executor for the dataset of size 122 GB.
> >     >
> >     >     Thanks,
> >     >     Gurvinder
> >     >     On 08/01/2014 04:42 AM, Michael Armbrust wrote:
> >     >     > cacheTable uses a special columnar caching technique that is
> >     >     > optimized for SchemaRDDs.  It something similar to
> >     MEMORY_ONLY_SER
> >     >     > but not quite. You can specify the persistence level on the
> >     >     > SchemaRDD itself and register that as a temporary table,
> >     however it
> >     >     > is likely you will not get as good performance.
> >     >     >
> >     >     >
> >     >     > On Thu, Jul 31, 2014 at 6:16 AM, Gurvinder Singh
> >     >     > <gurvinder.si...@uninett.no
> >     <mailto:gurvinder.si...@uninett.no>
> >     <mailto:gurvinder.si...@uninett.no <mailto:
> gurvinder.si...@uninett.no>>
> >     >     <mailto:gurvinder.si...@uninett.no
> >     <mailto:gurvinder.si...@uninett.no>
> >     <mailto:gurvinder.si...@uninett.no
> >     <mailto:gurvinder.si...@uninett.no>>>>
> >     >     > wrote:
> >     >     >
> >     >     > Hi,
> >     >     >
> >     >     > I am wondering how can I specify the persistence level in
> >     >     > cacheTable. As it is takes only table name as parameter. It
> >     should
> >     >     > be possible to specify the persistence level.
> >     >     >
> >     >     > - Gurvinder
> >     >     >
> >     >     >
> >     >
> >     >
> >
> >
>
>

Re: SQLCtx cacheTable

Reply via email to