I have update my spark source code to 1.3.1. the checkpoint works well.
BUT the shuffle data still could not be delete automatically…the disk usage is still 30TB… I have set the spark.cleaner.referenceTracking.blocking.shuffle to true. Do you know how to solve my problem? Sendong Li > 在 2015年3月31日,上午12:11,Xiangrui Meng <men...@gmail.com> 写道: > > setCheckpointInterval was added in the current master and branch-1.3. Please > help check whether it works. It will be included in the 1.3.1 and 1.4.0 > release. -Xiangrui > > On Mon, Mar 30, 2015 at 7:27 AM, lisendong <lisend...@163.com > <mailto:lisend...@163.com>> wrote: > hi, xiangrui: > I found the ALS of spark 1.3.0 forget to do checkpoint() in explicit ALS: > the code is : > https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala > > <https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala> > <PastedGraphic-2.tiff> > > the checkpoint is very important in my situation, because my task will > produce 1TB shuffle data in each iteration, it the shuffle data is not > deleted in each iteration(using checkpoint()), the task will produce 30TB > data… > > > So I change the ALS code, and re-compile by myself, but it seems the > checkpoint does not take effects, and the task still occupy 30TB disk… ( I > only add two lines to the ALS.scala) : > > <PastedGraphic-3.tiff> > > > > and the driver’s log seems strange, why the log is printed together... > <PastedGraphic-1.tiff> > > thank you very much! > > >> 在 2015年2月26日,下午11:33,163 <lisend...@163.com <mailto:lisend...@163.com>> 写道: >> >> Thank you very much for your opinion:) >> >> In our case, maybe it 's dangerous to treat un-observed item as negative >> interaction(although we could give them small confidence, I think they are >> still incredible...) >> >> I will do more experiments and give you feedback:) >> >> Thank you;) >> >> >>> 在 2015年2月26日,23:16,Sean Owen <so...@cloudera.com >>> <mailto:so...@cloudera.com>> 写道: >>> >>> I believe that's right, and is what I was getting at. yes the implicit >>> formulation ends up implicitly including every possible interaction in >>> its loss function, even unobserved ones. That could be the difference. >>> >>> This is mostly an academic question though. In practice, you have >>> click-like data and should be using the implicit version for sure. >>> >>> However you can give negative implicit feedback to the model. You >>> could consider no-click as a mild, observed, negative interaction. >>> That is: supply a small negative value for these cases. Unobserved >>> pairs are not part of the data set. I'd be careful about assuming the >>> lack of an action carries signal. >>> >>>> On Thu, Feb 26, 2015 at 3:07 PM, 163 <lisend...@163.com >>>> <mailto:lisend...@163.com>> wrote: >>>> oh my god, I think I understood... >>>> In my case, there are three kinds of user-item pairs: >>>> >>>> Display and click pair(positive pair) >>>> Display but no-click pair(negative pair) >>>> No-display pair(unobserved pair) >>>> >>>> Explicit ALS only consider the first and the second kinds >>>> But implicit ALS consider all the three kinds of pair(and consider the >>>> third >>>> kind as the second pair, because their preference value are all zero and >>>> confidence are all 1) >>>> >>>> So the result are different. right? >>>> >>>> Could you please give me some advice, which ALS should I use? >>>> If I use the implicit ALS, how to distinguish the second and the third kind >>>> of pair:) >>>> >>>> My opinion is in my case, I should use explicit ALS ... >>>> >>>> Thank you so much >>>> >>>> 在 2015年2月26日,22:41,Xiangrui Meng <m...@databricks.com >>>> <mailto:m...@databricks.com>> 写道: >>>> >>>> Lisen, did you use all m-by-n pairs during training? Implicit model >>>> penalizes unobserved ratings, while explicit model doesn't. -Xiangrui >>>> >>>>> On Feb 26, 2015 6:26 AM, "Sean Owen" <so...@cloudera.com >>>>> <mailto:so...@cloudera.com>> wrote: >>>>> >>>>> +user >>>>> >>>>>> On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen <so...@cloudera.com >>>>>> <mailto:so...@cloudera.com>> wrote: >>>>>> >>>>>> I think I may have it backwards, and that you are correct to keep the 0 >>>>>> elements in train() in order to try to reproduce the same result. >>>>>> >>>>>> The second formulation is called 'weighted regularization' and is used >>>>>> for both implicit and explicit feedback, as far as I can see in the code. >>>>>> >>>>>> Hm, I'm actually not clear why these would produce different results. >>>>>> Different code paths are used to be sure, but I'm not yet sure why they >>>>>> would give different results. >>>>>> >>>>>> In general you wouldn't use train() for data like this though, and would >>>>>> never set alpha=0. >>>>>> >>>>>>> On Thu, Feb 26, 2015 at 2:15 PM, lisendong <lisend...@163.com >>>>>>> <mailto:lisend...@163.com>> wrote: >>>>>>> >>>>>>> I want to confirm the loss function you use (sorry I’m not so familiar >>>>>>> with scala code so I did not understand the source code of mllib) >>>>>>> >>>>>>> According to the papers : >>>>>>> >>>>>>> >>>>>>> in your implicit feedback ALS, the loss function is (ICDM 2008): >>>>>>> >>>>>>> in the explicit feedback ALS, the loss function is (Netflix 2008): >>>>>>> >>>>>>> note that besides the difference of confidence parameter Cui, the >>>>>>> regularization is also different. does your code also has this >>>>>>> difference? >>>>>>> >>>>>>> Best Regards, >>>>>>> Sendong Li >>>>>>> >>>>>>> >>>>>>>> 在 2015年2月26日,下午9:42,lisendong <lisend...@163.com >>>>>>>> <mailto:lisend...@163.com>> 写道: >>>>>>>> >>>>>>>> Hi meng, fotero, sowen: >>>>>>>> >>>>>>>> I’m using ALS with spark 1.0.0, the code should be: >>>>>>>> >>>>>>>> https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala >>>>>>>> >>>>>>>> <https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala> >>>>>>>> >>>>>>>> I think the following two method should produce the same (or near) >>>>>>>> result: >>>>>>>> >>>>>>>> MatrixFactorizationModel model = ALS.train(ratings.rdd(), 30, 30, 0.01, >>>>>>>> -1, 1); >>>>>>>> >>>>>>>> MatrixFactorizationModel model = ALS.trainImplicit(ratings.rdd(), 30, >>>>>>>> 30, 0.01, -1, 0, 1); >>>>>>>> >>>>>>>> the data I used is display log, the format of log is as following: >>>>>>>> >>>>>>>> user item if-click >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I use 1.0 as score for click pair, and 0 as score for non-click pair. >>>>>>>> >>>>>>>> in the second method, the alpha is set to zero, so the confidence for >>>>>>>> positive and negative are both 1.0 (right?) >>>>>>>> >>>>>>>> I think the two method should produce similar result, but the result is >>>>>>>> : the second method’s result is very bad (the AUC of the first result >>>>>>>> is >>>>>>>> 0.7, but the AUC of the second result is only 0.61) >>>>>>>> >>>>>>>> >>>>>>>> I could not understand why, could you help me? >>>>>>>> >>>>>>>> >>>>>>>> Thank you very much! >>>>>>>> >>>>>>>> Best Regards, >>>>>>>> Sendong Li >>>>>>> >>>>>>> >>>>>> >>>>> > > > 邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。 > 共有 3 个附件 > PastedGraphic-2.tiff(48K) > 极速下载 > <http://preview.mail.163.com/xdownload?filename=PastedGraphic-2.tiff&mid=1tbiyBrMDVEAMpbFKgAAsJ&part=3&sign=cca8b2e547991f21222b2755d4e03f4d&time=1427731931&uid=lisendong%40163.com> > PastedGraphic-1.tiff(139K) > 极速下载 > <http://preview.mail.163.com/xdownload?filename=PastedGraphic-1.tiff&mid=1tbiyBrMDVEAMpbFKgAAsJ&part=4&sign=cca8b2e547991f21222b2755d4e03f4d&time=1427731931&uid=lisendong%40163.com> > PastedGraphic-3.tiff(81K) > 极速下载 > <http://preview.mail.163.com/xdownload?filename=PastedGraphic-3.tiff&mid=1tbiyBrMDVEAMpbFKgAAsJ&part=5&sign=cca8b2e547991f21222b2755d4e03f4d&time=1427731931&uid=lisendong%40163.com>