In my experiment, if I do not call gc() explicitly, the shuffle files will not be cleaned until the whole job finish… I don’t know why, maybe the rdd could not be GCed implicitly. In my situation, a full gc in driver takes about 10 seconds, so I start a thread in driver to do GC like this : (do GC every 120 seconds)
while (true) { System.gc(); Thread.sleep(120 * 1000); } it works well now. Do you have more elegant ways to clean the shuffle files? Best Regards, Sendong Li > 在 2015年4月1日,上午5:09,Xiangrui Meng <men...@gmail.com> 写道: > > Hey Guoqiang and Sendong, > > Could you comment on the overhead of calling gc() explicitly? The shuffle > files should get cleaned in a few seconds after checkpointing, but it is > certainly possible to accumulates TBs of files in a few seconds. In this > case, calling gc() may work the same as waiting for a few seconds after each > checkpoint. Is it correct? > > Best, > Xiangrui > > On Tue, Mar 31, 2015 at 8:58 AM, lisendong <lisend...@163.com > <mailto:lisend...@163.com>> wrote: > guoqiang ’s method works very well … > > it only takes 1TB disk now. > > thank you very much! > > > >> 在 2015年3月31日,下午4:47,GuoQiang Li <wi...@qq.com <mailto:wi...@qq.com>> 写道: >> >> You can try to enforce garbage collection: >> >> /** Run GC and make sure it actually has run */ >> def runGC() { >> val weakRef = new WeakReference(new Object()) >> val startTime = System.currentTimeMillis >> System.gc() // Make a best effort to run the garbage collection. It >> *usually* runs GC. >> // Wait until a weak reference object has been GCed >> System.runFinalization() >> while (weakRef.get != null) { >> System.gc() >> System.runFinalization() >> Thread.sleep(200) >> if (System.currentTimeMillis - startTime > 10000) { >> throw new Exception("automatically cleanup error") >> } >> } >> } >> >> >> ------------------ 原始邮件 ------------------ >> 发件人: "lisendong"<lisend...@163.com <mailto:lisend...@163.com>>; >> 发送时间: 2015年3月31日(星期二) 下午3:47 >> 收件人: "Xiangrui Meng"<men...@gmail.com <mailto:men...@gmail.com>>; >> 抄送: "Xiangrui Meng"<m...@databricks.com <mailto:m...@databricks.com>>; >> "user"<user@spark.apache.org <mailto:user@spark.apache.org>>; "Sean >> Owen"<so...@cloudera.com <mailto:so...@cloudera.com>>; "GuoQiang >> Li"<wi...@qq.com <mailto:wi...@qq.com>>; >> 主题: Re: different result from implicit ALS with explicit ALS >> >> I have update my spark source code to 1.3.1. >> >> the checkpoint works well. >> >> BUT the shuffle data still could not be delete automatically…the disk usage >> is still 30TB… >> >> I have set the spark.cleaner.referenceTracking.blocking.shuffle to true. >> >> Do you know how to solve my problem? >> >> Sendong Li >> >> >> >>> 在 2015年3月31日,上午12:11,Xiangrui Meng <men...@gmail.com >>> <mailto:men...@gmail.com>> 写道: >>> >>> setCheckpointInterval was added in the current master and branch-1.3. >>> Please help check whether it works. It will be included in the 1.3.1 and >>> 1.4.0 release. -Xiangrui >>> >>> On Mon, Mar 30, 2015 at 7:27 AM, lisendong <lisend...@163.com >>> <mailto:lisend...@163.com>> wrote: >>> hi, xiangrui: >>> I found the ALS of spark 1.3.0 forget to do checkpoint() in explicit ALS: >>> the code is : >>> https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala >>> >>> <https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala> >>> <PastedGraphic-2.tiff> >>> >>> the checkpoint is very important in my situation, because my task will >>> produce 1TB shuffle data in each iteration, it the shuffle data is not >>> deleted in each iteration(using checkpoint()), the task will produce 30TB >>> data… >>> >>> >>> So I change the ALS code, and re-compile by myself, but it seems the >>> checkpoint does not take effects, and the task still occupy 30TB disk… ( I >>> only add two lines to the ALS.scala) : >>> >>> <PastedGraphic-3.tiff> >>> >>> >>> >>> and the driver’s log seems strange, why the log is printed together... >>> <PastedGraphic-1.tiff> >>> >>> thank you very much! >>> >>> >>>> 在 2015年2月26日,下午11:33,163 <lisend...@163.com <mailto:lisend...@163.com>> 写道: >>>> >>>> Thank you very much for your opinion:) >>>> >>>> In our case, maybe it 's dangerous to treat un-observed item as negative >>>> interaction(although we could give them small confidence, I think they are >>>> still incredible...) >>>> >>>> I will do more experiments and give you feedback:) >>>> >>>> Thank you;) >>>> >>>> >>>>> 在 2015年2月26日,23:16,Sean Owen <so...@cloudera.com >>>>> <mailto:so...@cloudera.com>> 写道: >>>>> >>>>> I believe that's right, and is what I was getting at. yes the implicit >>>>> formulation ends up implicitly including every possible interaction in >>>>> its loss function, even unobserved ones. That could be the difference. >>>>> >>>>> This is mostly an academic question though. In practice, you have >>>>> click-like data and should be using the implicit version for sure. >>>>> >>>>> However you can give negative implicit feedback to the model. You >>>>> could consider no-click as a mild, observed, negative interaction. >>>>> That is: supply a small negative value for these cases. Unobserved >>>>> pairs are not part of the data set. I'd be careful about assuming the >>>>> lack of an action carries signal. >>>>> >>>>>> On Thu, Feb 26, 2015 at 3:07 PM, 163 <lisend...@163.com >>>>>> <mailto:lisend...@163.com>> wrote: >>>>>> oh my god, I think I understood... >>>>>> In my case, there are three kinds of user-item pairs: >>>>>> >>>>>> Display and click pair(positive pair) >>>>>> Display but no-click pair(negative pair) >>>>>> No-display pair(unobserved pair) >>>>>> >>>>>> Explicit ALS only consider the first and the second kinds >>>>>> But implicit ALS consider all the three kinds of pair(and consider the >>>>>> third >>>>>> kind as the second pair, because their preference value are all zero and >>>>>> confidence are all 1) >>>>>> >>>>>> So the result are different. right? >>>>>> >>>>>> Could you please give me some advice, which ALS should I use? >>>>>> If I use the implicit ALS, how to distinguish the second and the third >>>>>> kind >>>>>> of pair:) >>>>>> >>>>>> My opinion is in my case, I should use explicit ALS ... >>>>>> >>>>>> Thank you so much >>>>>> >>>>>> 在 2015年2月26日,22:41,Xiangrui Meng <m...@databricks.com >>>>>> <mailto:m...@databricks.com>> 写道: >>>>>> >>>>>> Lisen, did you use all m-by-n pairs during training? Implicit model >>>>>> penalizes unobserved ratings, while explicit model doesn't. -Xiangrui >>>>>> >>>>>>> On Feb 26, 2015 6:26 AM, "Sean Owen" <so...@cloudera.com >>>>>>> <mailto:so...@cloudera.com>> wrote: >>>>>>> >>>>>>> +user >>>>>>> >>>>>>>> On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen <so...@cloudera.com >>>>>>>> <mailto:so...@cloudera.com>> wrote: >>>>>>>> >>>>>>>> I think I may have it backwards, and that you are correct to keep the 0 >>>>>>>> elements in train() in order to try to reproduce the same result. >>>>>>>> >>>>>>>> The second formulation is called 'weighted regularization' and is used >>>>>>>> for both implicit and explicit feedback, as far as I can see in the >>>>>>>> code. >>>>>>>> >>>>>>>> Hm, I'm actually not clear why these would produce different results. >>>>>>>> Different code paths are used to be sure, but I'm not yet sure why they >>>>>>>> would give different results. >>>>>>>> >>>>>>>> In general you wouldn't use train() for data like this though, and >>>>>>>> would >>>>>>>> never set alpha=0. >>>>>>>> >>>>>>>>> On Thu, Feb 26, 2015 at 2:15 PM, lisendong <lisend...@163.com >>>>>>>>> <mailto:lisend...@163.com>> wrote: >>>>>>>>> >>>>>>>>> I want to confirm the loss function you use (sorry I’m not so familiar >>>>>>>>> with scala code so I did not understand the source code of mllib) >>>>>>>>> >>>>>>>>> According to the papers : >>>>>>>>> >>>>>>>>> >>>>>>>>> in your implicit feedback ALS, the loss function is (ICDM 2008): >>>>>>>>> >>>>>>>>> in the explicit feedback ALS, the loss function is (Netflix 2008): >>>>>>>>> >>>>>>>>> note that besides the difference of confidence parameter Cui, the >>>>>>>>> regularization is also different. does your code also has this >>>>>>>>> difference? >>>>>>>>> >>>>>>>>> Best Regards, >>>>>>>>> Sendong Li >>>>>>>>> >>>>>>>>> >>>>>>>>>> 在 2015年2月26日,下午9:42,lisendong <lisend...@163.com >>>>>>>>>> <mailto:lisend...@163.com>> 写道: >>>>>>>>>> >>>>>>>>>> Hi meng, fotero, sowen: >>>>>>>>>> >>>>>>>>>> I’m using ALS with spark 1.0.0, the code should be: >>>>>>>>>> >>>>>>>>>> https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala >>>>>>>>>> >>>>>>>>>> <https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala> >>>>>>>>>> >>>>>>>>>> I think the following two method should produce the same (or near) >>>>>>>>>> result: >>>>>>>>>> >>>>>>>>>> MatrixFactorizationModel model = ALS.train(ratings.rdd(), 30, 30, >>>>>>>>>> 0.01, >>>>>>>>>> -1, 1); >>>>>>>>>> >>>>>>>>>> MatrixFactorizationModel model = ALS.trainImplicit(ratings.rdd(), 30, >>>>>>>>>> 30, 0.01, -1, 0, 1); >>>>>>>>>> >>>>>>>>>> the data I used is display log, the format of log is as following: >>>>>>>>>> >>>>>>>>>> user item if-click >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I use 1.0 as score for click pair, and 0 as score for non-click pair. >>>>>>>>>> >>>>>>>>>> in the second method, the alpha is set to zero, so the confidence for >>>>>>>>>> positive and negative are both 1.0 (right?) >>>>>>>>>> >>>>>>>>>> I think the two method should produce similar result, but the result >>>>>>>>>> is >>>>>>>>>> : the second method’s result is very bad (the AUC of the first >>>>>>>>>> result is >>>>>>>>>> 0.7, but the AUC of the second result is only 0.61) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I could not understand why, could you help me? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thank you very much! >>>>>>>>>> >>>>>>>>>> Best Regards, >>>>>>>>>> Sendong Li >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>> >>> >>> 邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。 >>> 共有 3 个附件 >>> PastedGraphic-2.tiff(48K) >>> 极速下载 >>> <http://preview.mail.163.com/xdownload?filename=PastedGraphic-2.tiff&mid=1tbiyBrMDVEAMpbFKgAAsJ&part=3&sign=cca8b2e547991f21222b2755d4e03f4d&time=1427731931&uid=lisendong%40163.com> >>> PastedGraphic-1.tiff(139K) >>> 极速下载 >>> <http://preview.mail.163.com/xdownload?filename=PastedGraphic-1.tiff&mid=1tbiyBrMDVEAMpbFKgAAsJ&part=4&sign=cca8b2e547991f21222b2755d4e03f4d&time=1427731931&uid=lisendong%40163.com> >>> PastedGraphic-3.tiff(81K) >>> 极速下载 >>> <http://preview.mail.163.com/xdownload?filename=PastedGraphic-3.tiff&mid=1tbiyBrMDVEAMpbFKgAAsJ&part=5&sign=cca8b2e547991f21222b2755d4e03f4d&time=1427731931&uid=lisendong%40163.com> >