Just a clarification, I am using Spark ALS explicit feedback on standalone cluster without deploying zookeeper master HA option yet...
When in the standalone spark cluster, worker fails due to GC error, the worker dies as well and I have to restart the worker....Understanding this issue will be useful as we deploy the solution... On Wed, Mar 26, 2014 at 7:31 AM, Debasish Das <debasish.da...@gmail.com>wrote: > Thanks Sean. Looking into executor memory options now... > > I am at incubator_spark head. Does that has all the fixes or I need spark > head ? I can deploy the spark head as well... > > I am not running implicit feedback yet...I remember memory enhancements > were mainly for implicit right ? > > For ulimit let me look into centos settings....I am curious how map-reduce > resolves it....by using 1 core from 1 process ? I am running 2 tb yarn jobs > as well for etl, pre processing etc....have not seen the too many files > opened yet.... > > when there is gc error, the worker dies....that's mystery as well...any > insights from spark core team ? Yarn container gets killed if gc boundaries > are about to hit....similar ideas can be used here as well ? Also which > tool do we use for memory debugging in spark ? > On Mar 26, 2014 1:45 AM, "Sean Owen" <so...@cloudera.com> wrote: > >> Much of this sounds related to the memory issue mentioned earlier in this >> thread. Are you using a build that has fixed that? That would be by far >> most important here. >> >> If the raw memory requirement is 8GB, the actual heap size necessary could >> be a lot larger -- object overhead, all the other stuff in memory, >> overheads within the heap allocation, etc. So I would expect total memory >> requirement to be significantly more than 9GB. >> >> Still, this is the *total* requirement across the cluster. Each worker is >> just loading part of the matrix. If you have 10 workers I would imagine it >> roughly chops the per-worker memory requirement by 10x. >> >> This in turn depends on also letting workers use more than their default >> amount of memory. May need to increase executor memory here. >> >> Separately, I have observed issues with too many files open and lots of >> /tmp files. You may have to use ulimit to increase the number of open >> files >> allowed. >> >> On Wed, Mar 26, 2014 at 6:06 AM, Debasish Das <debasish.da...@gmail.com >> >wrote: >> >> > Hi, >> > >> > For our usecases we are looking into 20 x 1M matrices which comes in the >> > similar ranges as outlined by the paper over here: >> > >> > >> http://sandeeptata.blogspot.com/2012/12/sparkler-large-scale-matrix.html >> > >> > Is the exponential runtime growth in spark ALS as outlined by the blog >> > still exists in recommendation.ALS ? >> > >> > I am running a spark cluster of 10 nodes with total memory of around 1 >> TB >> > with 80 cores.... >> > >> > With rank = 50, the memory requirements for ALS should be 20Mx50 >> doubles on >> > every worker which is around 8 GB.... >> > >> > Even if both the factor matrices are cached in memory I should be >> bounded >> > by ~ 9 GB but even with 32 GB per worker I see GC errors... >> > >> > I am debugging the scalability and memory requirements of the algorithm >> > further but any insights will be very helpful... >> > >> > Also there are two other issues: >> > >> > 1. If GC errors are hit, that worker JVM goes down and I have to >> restart it >> > manually. Is this expected ? >> > >> > 2. When I try to make use of all 80 cores on the cluster I get some >> issues >> > related to java.io.File not found exception on /tmp/ ? Is there some OS >> > limit that how many cores can simultaneously access /tmp from a process >> ? >> > >> > Thanks. >> > Deb >> > >> > >> >