Re: SQLCtx cacheTable

2014-08-04 Thread Gurvinder Singh
On 08/04/2014 10:57 PM, Michael Armbrust wrote:
> If mesos is allocating a container that is exactly the same as the max
> heap size then that is leaving no buffer space for non-heap JVM memory,
> which seems wrong to me.
> 
This can be a cause. I am now wondering how mesos pick up the size and
setup the -Xmx parameter.
> The problem here is that cacheTable is more aggressive about grabbing
> large ByteBuffers during caching (which it later releases when it knows
> the exact size of the data)  There is a discussion here about trying to
> improve this: https://issues.apache.org/jira/browse/SPARK-2650
> 
I am not sure if this issue is the one which is causing issue for us. As
we have approx 60GB of cached data size, where as each executor memory
is 17GB and there are 15 of them so in total 255GB which is way more
than cached data of 60GB.

Any suggestions as where to look for changing the mesos setting in this
case.

- Gurvinder
> 
> On Sun, Aug 3, 2014 at 11:35 PM, Gurvinder Singh
> mailto:gurvinder.si...@uninett.no>> wrote:
> 
> On 08/03/2014 02:33 AM, Michael Armbrust wrote:
> > I am not a mesos expert... but it sounds like there is some mismatch
> > between the size that mesos is giving you and the maximum heap size of
> > the executors (-Xmx).
> >
> It seems that mesos is giving the correct size to java process. It has
> exact size set in -Xms/-Xmx params. Do you if somehow I can find which
> class or thread inside the spark jvm process is using how much memory
> and see which makes it to reach the memory limit on CacheTable case
> where as not in cache RDD case.
> 
> - Gurvinder
> >
> > On Fri, Aug 1, 2014 at 12:07 AM, Gurvinder Singh
> > mailto:gurvinder.si...@uninett.no>
>  >> wrote:
> >
> > It is not getting out of memory exception. I am using Mesos as
> cluster
> > manager and it says when I use cacheTable that the container
> has used
> > all of its allocated memory and thus kill it. I can see it in
> the logs
> > on mesos-slave where executor runs. But on the web UI of spark
> > application, it shows that is still have 4-5GB space left for
> > caching/storing. So I am wondering how the memory is handled in
> > cacheTable case. Does it reserve the memory storage and other
> parts run
> > out of their memory. I also tries to change the
> > "spark.storage.memoryFraction" but that did not help.
> >
> > - Gurvinder
> > On 08/01/2014 08:42 AM, Michael Armbrust wrote:
> > > Are you getting OutOfMemoryExceptions with cacheTable? or
> what do you
> > > mean when you say you have to specify larger executor
> memory?  You
> > might
> > > be running into SPARK-2650
> > > .
> > >
> > > Is there something else you are trying to accomplish by
> setting the
> > > persistence level?  If you are looking for something like
> > DISK_ONLY you
> > > can simulate that now using saveAsParquetFile and parquetFile.
> > >
> > > It is possible long term that we will automatically map the
> > standard RDD
> > > persistence levels to these more efficient implementations
> in the
> > future.
> > >
> > >
> > > On Thu, Jul 31, 2014 at 11:26 PM, Gurvinder Singh
> > >  
> >
> >  
> >   > >
> > > Thanks Michael for explaination. Actually I tried
> caching the
> > RDD and
> > > making table on it. But the performance for cacheTable
> was 3X
> > better
> > > than caching RDD. Now I know why it is better. But is it
> > possible to
> > > add the support for persistence level into cacheTable itself
> > like RDD.
> > > May be it is not related, but on the same size of data set,
> > when I use
> > > cacheTable I have to specify larger executor memory than
> I need in
> > > case of caching RDD. Although in the storage tab on
> status web
> > UI, the
> > > memory footprint is almost same 58.3 GB in cacheTable and
> > 59.7GB in
> > > cache RDD. Is it possible that there is some memory leak or
> > cacheTable
> > > works differently and thus require higher memory. The
> > difference is
> > > 5GB per executor for the dataset of size 122 GB.
> > >
> > > Thanks,

Re: SQLCtx cacheTable

2014-08-04 Thread Michael Armbrust
If mesos is allocating a container that is exactly the same as the max heap
size then that is leaving no buffer space for non-heap JVM memory, which
seems wrong to me.

The problem here is that cacheTable is more aggressive about grabbing large
ByteBuffers during caching (which it later releases when it knows the exact
size of the data)  There is a discussion here about trying to improve this:
https://issues.apache.org/jira/browse/SPARK-2650


On Sun, Aug 3, 2014 at 11:35 PM, Gurvinder Singh  wrote:

> On 08/03/2014 02:33 AM, Michael Armbrust wrote:
> > I am not a mesos expert... but it sounds like there is some mismatch
> > between the size that mesos is giving you and the maximum heap size of
> > the executors (-Xmx).
> >
> It seems that mesos is giving the correct size to java process. It has
> exact size set in -Xms/-Xmx params. Do you if somehow I can find which
> class or thread inside the spark jvm process is using how much memory
> and see which makes it to reach the memory limit on CacheTable case
> where as not in cache RDD case.
>
> - Gurvinder
> >
> > On Fri, Aug 1, 2014 at 12:07 AM, Gurvinder Singh
> > mailto:gurvinder.si...@uninett.no>> wrote:
> >
> > It is not getting out of memory exception. I am using Mesos as
> cluster
> > manager and it says when I use cacheTable that the container has used
> > all of its allocated memory and thus kill it. I can see it in the
> logs
> > on mesos-slave where executor runs. But on the web UI of spark
> > application, it shows that is still have 4-5GB space left for
> > caching/storing. So I am wondering how the memory is handled in
> > cacheTable case. Does it reserve the memory storage and other parts
> run
> > out of their memory. I also tries to change the
> > "spark.storage.memoryFraction" but that did not help.
> >
> > - Gurvinder
> > On 08/01/2014 08:42 AM, Michael Armbrust wrote:
> > > Are you getting OutOfMemoryExceptions with cacheTable? or what do
> you
> > > mean when you say you have to specify larger executor memory?  You
> > might
> > > be running into SPARK-2650
> > > .
> > >
> > > Is there something else you are trying to accomplish by setting the
> > > persistence level?  If you are looking for something like
> > DISK_ONLY you
> > > can simulate that now using saveAsParquetFile and parquetFile.
> > >
> > > It is possible long term that we will automatically map the
> > standard RDD
> > > persistence levels to these more efficient implementations in the
> > future.
> > >
> > >
> > > On Thu, Jul 31, 2014 at 11:26 PM, Gurvinder Singh
> > > mailto:gurvinder.si...@uninett.no>
> >  > >> wrote:
> > >
> > > Thanks Michael for explaination. Actually I tried caching the
> > RDD and
> > > making table on it. But the performance for cacheTable was 3X
> > better
> > > than caching RDD. Now I know why it is better. But is it
> > possible to
> > > add the support for persistence level into cacheTable itself
> > like RDD.
> > > May be it is not related, but on the same size of data set,
> > when I use
> > > cacheTable I have to specify larger executor memory than I
> need in
> > > case of caching RDD. Although in the storage tab on status web
> > UI, the
> > > memory footprint is almost same 58.3 GB in cacheTable and
> > 59.7GB in
> > > cache RDD. Is it possible that there is some memory leak or
> > cacheTable
> > > works differently and thus require higher memory. The
> > difference is
> > > 5GB per executor for the dataset of size 122 GB.
> > >
> > > Thanks,
> > > Gurvinder
> > > On 08/01/2014 04:42 AM, Michael Armbrust wrote:
> > > > cacheTable uses a special columnar caching technique that is
> > > > optimized for SchemaRDDs.  It something similar to
> > MEMORY_ONLY_SER
> > > > but not quite. You can specify the persistence level on the
> > > > SchemaRDD itself and register that as a temporary table,
> > however it
> > > > is likely you will not get as good performance.
> > > >
> > > >
> > > > On Thu, Jul 31, 2014 at 6:16 AM, Gurvinder Singh
> > > >  > 
> > >
> > >  > 
> >  >  > > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > I am wondering how can I specify the persistence level in
> > > > cacheTable. As it is takes only table name as parameter. It
> > shou

Re: SQLCtx cacheTable

2014-08-03 Thread Gurvinder Singh
On 08/03/2014 02:33 AM, Michael Armbrust wrote:
> I am not a mesos expert... but it sounds like there is some mismatch
> between the size that mesos is giving you and the maximum heap size of
> the executors (-Xmx).
> 
It seems that mesos is giving the correct size to java process. It has
exact size set in -Xms/-Xmx params. Do you if somehow I can find which
class or thread inside the spark jvm process is using how much memory
and see which makes it to reach the memory limit on CacheTable case
where as not in cache RDD case.

- Gurvinder
> 
> On Fri, Aug 1, 2014 at 12:07 AM, Gurvinder Singh
> mailto:gurvinder.si...@uninett.no>> wrote:
> 
> It is not getting out of memory exception. I am using Mesos as cluster
> manager and it says when I use cacheTable that the container has used
> all of its allocated memory and thus kill it. I can see it in the logs
> on mesos-slave where executor runs. But on the web UI of spark
> application, it shows that is still have 4-5GB space left for
> caching/storing. So I am wondering how the memory is handled in
> cacheTable case. Does it reserve the memory storage and other parts run
> out of their memory. I also tries to change the
> "spark.storage.memoryFraction" but that did not help.
> 
> - Gurvinder
> On 08/01/2014 08:42 AM, Michael Armbrust wrote:
> > Are you getting OutOfMemoryExceptions with cacheTable? or what do you
> > mean when you say you have to specify larger executor memory?  You
> might
> > be running into SPARK-2650
> > .
> >
> > Is there something else you are trying to accomplish by setting the
> > persistence level?  If you are looking for something like
> DISK_ONLY you
> > can simulate that now using saveAsParquetFile and parquetFile.
> >
> > It is possible long term that we will automatically map the
> standard RDD
> > persistence levels to these more efficient implementations in the
> future.
> >
> >
> > On Thu, Jul 31, 2014 at 11:26 PM, Gurvinder Singh
> > mailto:gurvinder.si...@uninett.no>
>  >> wrote:
> >
> > Thanks Michael for explaination. Actually I tried caching the
> RDD and
> > making table on it. But the performance for cacheTable was 3X
> better
> > than caching RDD. Now I know why it is better. But is it
> possible to
> > add the support for persistence level into cacheTable itself
> like RDD.
> > May be it is not related, but on the same size of data set,
> when I use
> > cacheTable I have to specify larger executor memory than I need in
> > case of caching RDD. Although in the storage tab on status web
> UI, the
> > memory footprint is almost same 58.3 GB in cacheTable and
> 59.7GB in
> > cache RDD. Is it possible that there is some memory leak or
> cacheTable
> > works differently and thus require higher memory. The
> difference is
> > 5GB per executor for the dataset of size 122 GB.
> >
> > Thanks,
> > Gurvinder
> > On 08/01/2014 04:42 AM, Michael Armbrust wrote:
> > > cacheTable uses a special columnar caching technique that is
> > > optimized for SchemaRDDs.  It something similar to
> MEMORY_ONLY_SER
> > > but not quite. You can specify the persistence level on the
> > > SchemaRDD itself and register that as a temporary table,
> however it
> > > is likely you will not get as good performance.
> > >
> > >
> > > On Thu, Jul 31, 2014 at 6:16 AM, Gurvinder Singh
> > >  
> >
> >  
>   > > wrote:
> > >
> > > Hi,
> > >
> > > I am wondering how can I specify the persistence level in
> > > cacheTable. As it is takes only table name as parameter. It
> should
> > > be possible to specify the persistence level.
> > >
> > > - Gurvinder
> > >
> > >
> >
> >
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SQLCtx cacheTable

2014-08-02 Thread Michael Armbrust
I am not a mesos expert... but it sounds like there is some mismatch
between the size that mesos is giving you and the maximum heap size of the
executors (-Xmx).


On Fri, Aug 1, 2014 at 12:07 AM, Gurvinder Singh  wrote:

> It is not getting out of memory exception. I am using Mesos as cluster
> manager and it says when I use cacheTable that the container has used
> all of its allocated memory and thus kill it. I can see it in the logs
> on mesos-slave where executor runs. But on the web UI of spark
> application, it shows that is still have 4-5GB space left for
> caching/storing. So I am wondering how the memory is handled in
> cacheTable case. Does it reserve the memory storage and other parts run
> out of their memory. I also tries to change the
> "spark.storage.memoryFraction" but that did not help.
>
> - Gurvinder
> On 08/01/2014 08:42 AM, Michael Armbrust wrote:
> > Are you getting OutOfMemoryExceptions with cacheTable? or what do you
> > mean when you say you have to specify larger executor memory?  You might
> > be running into SPARK-2650
> > .
> >
> > Is there something else you are trying to accomplish by setting the
> > persistence level?  If you are looking for something like DISK_ONLY you
> > can simulate that now using saveAsParquetFile and parquetFile.
> >
> > It is possible long term that we will automatically map the standard RDD
> > persistence levels to these more efficient implementations in the future.
> >
> >
> > On Thu, Jul 31, 2014 at 11:26 PM, Gurvinder Singh
> > mailto:gurvinder.si...@uninett.no>> wrote:
> >
> > Thanks Michael for explaination. Actually I tried caching the RDD and
> > making table on it. But the performance for cacheTable was 3X better
> > than caching RDD. Now I know why it is better. But is it possible to
> > add the support for persistence level into cacheTable itself like
> RDD.
> > May be it is not related, but on the same size of data set, when I
> use
> > cacheTable I have to specify larger executor memory than I need in
> > case of caching RDD. Although in the storage tab on status web UI,
> the
> > memory footprint is almost same 58.3 GB in cacheTable and 59.7GB in
> > cache RDD. Is it possible that there is some memory leak or
> cacheTable
> > works differently and thus require higher memory. The difference is
> > 5GB per executor for the dataset of size 122 GB.
> >
> > Thanks,
> > Gurvinder
> > On 08/01/2014 04:42 AM, Michael Armbrust wrote:
> > > cacheTable uses a special columnar caching technique that is
> > > optimized for SchemaRDDs.  It something similar to MEMORY_ONLY_SER
> > > but not quite. You can specify the persistence level on the
> > > SchemaRDD itself and register that as a temporary table, however it
> > > is likely you will not get as good performance.
> > >
> > >
> > > On Thu, Jul 31, 2014 at 6:16 AM, Gurvinder Singh
> > > mailto:gurvinder.si...@uninett.no>
> > >>
> > > wrote:
> > >
> > > Hi,
> > >
> > > I am wondering how can I specify the persistence level in
> > > cacheTable. As it is takes only table name as parameter. It should
> > > be possible to specify the persistence level.
> > >
> > > - Gurvinder
> > >
> > >
> >
> >
>
>


Re: SQLCtx cacheTable

2014-07-31 Thread Gurvinder Singh
Thanks Michael for explaination. Actually I tried caching the RDD and
making table on it. But the performance for cacheTable was 3X better
than caching RDD. Now I know why it is better. But is it possible to
add the support for persistence level into cacheTable itself like RDD.
May be it is not related, but on the same size of data set, when I use
cacheTable I have to specify larger executor memory than I need in
case of caching RDD. Although in the storage tab on status web UI, the
memory footprint is almost same 58.3 GB in cacheTable and 59.7GB in
cache RDD. Is it possible that there is some memory leak or cacheTable
works differently and thus require higher memory. The difference is
5GB per executor for the dataset of size 122 GB.

Thanks,
Gurvinder
On 08/01/2014 04:42 AM, Michael Armbrust wrote:
> cacheTable uses a special columnar caching technique that is
> optimized for SchemaRDDs.  It something similar to MEMORY_ONLY_SER
> but not quite. You can specify the persistence level on the
> SchemaRDD itself and register that as a temporary table, however it
> is likely you will not get as good performance.
> 
> 
> On Thu, Jul 31, 2014 at 6:16 AM, Gurvinder Singh 
> mailto:gurvinder.si...@uninett.no>>
> wrote:
> 
> Hi,
> 
> I am wondering how can I specify the persistence level in
> cacheTable. As it is takes only table name as parameter. It should
> be possible to specify the persistence level.
> 
> - Gurvinder
> 
> 



Re: SQLCtx cacheTable

2014-07-31 Thread Michael Armbrust
cacheTable uses a special columnar caching technique that is optimized for
SchemaRDDs.  It something similar to MEMORY_ONLY_SER but not quite.  You
can specify the persistence level on the SchemaRDD itself and register that
as a temporary table, however it is likely you will not get as good
performance.


On Thu, Jul 31, 2014 at 6:16 AM, Gurvinder Singh  wrote:

> Hi,
>
> I am wondering how can I specify the persistence level in cacheTable. As
> it is takes only table name as parameter. It should be possible to
> specify the persistence level.
>
> - Gurvinder
>


SQLCtx cacheTable

2014-07-31 Thread Gurvinder Singh
Hi,

I am wondering how can I specify the persistence level in cacheTable. As
it is takes only table name as parameter. It should be possible to
specify the persistence level.

- Gurvinder