Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

Ewan Higgs Tue, 08 Dec 2015 06:44:02 -0800

Sean,

Thanks.
It's a developer API and doesn't appear to be exposed.


Ewan

On 07/12/15 15:06, Sean Owen wrote:

I'm not sure if this is available in Python but from 1.3 on you should
be able to call ALS.setFinalRDDStorageLevel with level "none" to ask
it to unpersist when it is done.

On Mon, Dec 7, 2015 at 1:42 PM, Ewan Higgs <ewan.hi...@ugent.be> wrote:

Jonathan,
Did you ever get to the bottom of this? I have some users working with Spark
in a classroom setting and our example notebooks run into problems where
there is so much spilled to disk that they run out of quota. A 1.5G input
set becomes >30G of spilled data on disk. I looked into how I could
unpersist the data so I could clean up the files, but I was unsuccessful.

We're using Spark 1.5.0

Yours,
Ewan

On 16/07/15 23:18, Stahlman, Jonathan wrote:

Hello all,

I am running the Spark recommendation algorithm in MLlib and I have been
studying its output with various model configurations.  Ideally I would like
to be able to run one job that trains the recommendation model with many
different configurations to try to optimize for performance.  A sample code
in python is copied below.

The issue I have is that each new model which is trained caches a set of
RDDs and eventually the executors run out of memory.  Is there any way in
Pyspark to unpersist() these RDDs after each iteration?  The names of the
RDDs which I gather from the UI is:

itemInBlocks
itemOutBlocks
Products
ratingBlocks
userInBlocks
userOutBlocks
users

I am using Spark 1.3.  Thank you for any help!

Regards,
Jonathan




   data_train, data_cv, data_test = data.randomSplit([99,1,1], 2)
   functions = [rating] #defined elsewhere
   ranks = [10,20]
   iterations = [10,20]
   lambdas = [0.01,0.1]
   alphas  = [1.0,50.0]

   results = []
   for ratingFunction, rank, numIterations, m_lambda, m_alpha in
itertools.product( functions, ranks, iterations, lambdas, alphas ):
     #train model
     ratings_train = data_train.map(lambda l: Rating( l.user, l.product,
ratingFunction(l) ) )
     model   = ALS.trainImplicit( ratings_train, rank, numIterations,
lambda_=float(m_lambda), alpha=float(m_alpha) )

     #test performance on CV data
     ratings_cv = data_cv.map(lambda l: Rating( l.uesr, l.product,
ratingFunction(l) ) )
     auc = areaUnderCurve( ratings_cv, model.predictAll )

     #save results
     result = ",".join(str(l) for l in
[ratingFunction.__name__,rank,numIterations,m_lambda,m_alpha,auc])
     results.append(result)

________________________________

The information contained in this e-mail is confidential and/or proprietary
to Capital One and/or its affiliates and may only be used solely in
performance of work or services for Capital One. The information transmitted
herewith is intended only for use by the individual or entity to which it is
addressed. If the reader of this message is not the intended recipient, you
are hereby notified that any review, retransmission, dissemination,
distribution, copying or other use of, or taking of any action in reliance
upon this information is strictly prohibited. If you have received this
communication in error, please contact the sender and delete the material
from your computer.



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

Reply via email to