Re: different result from implicit ALS with explicit ALS

lisendong Tue, 31 Mar 2015 09:05:43 -0700

guoqiang ??s method works very well ??

it only takes 1TB disk now.


thank you very much!



> ?? 2015??3??31????????4:47??GuoQiang Li <wi...@qq.com> ??????
> 
> You can try to enforce garbage collection:
> 
> /** Run GC and make sure it actually has run */
> def runGC() {
>   val weakRef = new WeakReference(new Object())
>   val startTime = System.currentTimeMillis
>   System.gc() // Make a best effort to run the garbage collection. It 
> *usually* runs GC.
>   // Wait until a weak reference object has been GCed
>   System.runFinalization()
>   while (weakRef.get != null) {
>     System.gc()
>     System.runFinalization()
>     Thread.sleep(200)
>     if (System.currentTimeMillis - startTime > 10000) {
>       throw new Exception("automatically cleanup error")
>     }
>   }
> }
> 
> 
> ------------------ ???????? ------------------
> ??????: "lisendong"<lisend...@163.com <mailto:lisend...@163.com>>; 
> ????????: 2015??3??31??(??????) ????3:47
> ??????: "Xiangrui Meng"<men...@gmail.com <mailto:men...@gmail.com>>; 
> ????: "Xiangrui Meng"<m...@databricks.com <mailto:m...@databricks.com>>; 
> "user"<user@spark.apache.org <mailto:user@spark.apache.org>>; "Sean 
> Owen"<so...@cloudera.com <mailto:so...@cloudera.com>>; "GuoQiang 
> Li"<wi...@qq.com <mailto:wi...@qq.com>>; 
> ????: Re: different result from implicit ALS with explicit ALS
> 
> I have update my spark source code to 1.3.1.
> 
> the checkpoint works well. 
> 
> BUT the shuffle data still could not be delete automatically??the disk usage 
> is still 30TB??
> 
> I have set the spark.cleaner.referenceTracking.blocking.shuffle to true.
> 
> Do you know how to solve my problem?
> 
> Sendong Li
> 
> 
> 
>> ?? 2015??3??31????????12:11??Xiangrui Meng <men...@gmail.com 
>> <mailto:men...@gmail.com>> ??????
>> 
>> setCheckpointInterval was added in the current master and branch-1.3. Please 
>> help check whether it works. It will be included in the 1.3.1 and 1.4.0 
>> release. -Xiangrui
>> 
>> On Mon, Mar 30, 2015 at 7:27 AM, lisendong <lisend...@163.com 
>> <mailto:lisend...@163.com>> wrote:
>> hi, xiangrui:
>> I found the ALS of spark 1.3.0 forget to do checkpoint() in explicit ALS:
>> the code is :
>> https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
>>  
>> <https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala>
>> <PastedGraphic-2.tiff>
>> 
>> the checkpoint is very important in my situation, because my task will 
>> produce 1TB shuffle data in each iteration, it the shuffle data is not 
>> deleted in each iteration(using checkpoint()), the task will produce 30TB 
>> data??
>> 
>> 
>> So I change the ALS code, and re-compile by myself, but it seems the 
>> checkpoint does not take effects, and the task still occupy 30TB disk?? ( I 
>> only add two lines to the ALS.scala) :
>> 
>> <PastedGraphic-3.tiff>
>> 
>> 
>> 
>> and the driver??s log seems strange, why the log is printed together...
>> <PastedGraphic-1.tiff>
>> 
>> thank you very much!
>> 
>> 
>>> ?? 2015??2??26????????11:33??163 <lisend...@163.com 
>>> <mailto:lisend...@163.com>> ??????
>>> 
>>> Thank you very much for your opinion:)
>>> 
>>> In our case, maybe it 's dangerous to treat un-observed item as negative 
>>> interaction(although we could give them small confidence, I think they are 
>>> still incredible...)
>>> 
>>> I will do more experiments and give you feedback:)
>>> 
>>> Thank you;)
>>> 
>>> 
>>>> ?? 2015??2??26????23:16??Sean Owen <so...@cloudera.com 
>>>> <mailto:so...@cloudera.com>> ??????
>>>> 
>>>> I believe that's right, and is what I was getting at. yes the implicit
>>>> formulation ends up implicitly including every possible interaction in
>>>> its loss function, even unobserved ones. That could be the difference.
>>>> 
>>>> This is mostly an academic question though. In practice, you have
>>>> click-like data and should be using the implicit version for sure.
>>>> 
>>>> However you can give negative implicit feedback to the model. You
>>>> could consider no-click as a mild, observed, negative interaction.
>>>> That is: supply a small negative value for these cases. Unobserved
>>>> pairs are not part of the data set. I'd be careful about assuming the
>>>> lack of an action carries signal.
>>>> 
>>>>> On Thu, Feb 26, 2015 at 3:07 PM, 163 <lisend...@163.com 
>>>>> <mailto:lisend...@163.com>> wrote:
>>>>> oh my god, I think I understood...
>>>>> In my case, there are three kinds of user-item pairs:
>>>>> 
>>>>> Display and click pair(positive pair)
>>>>> Display but no-click pair(negative pair)
>>>>> No-display pair(unobserved pair)
>>>>> 
>>>>> Explicit ALS only consider the first and the second kinds
>>>>> But implicit ALS consider all the three kinds of pair(and consider the 
>>>>> third
>>>>> kind as the second pair, because their preference value are all zero and
>>>>> confidence are all 1)
>>>>> 
>>>>> So the result are different. right?
>>>>> 
>>>>> Could you please give me some advice, which ALS should I use?
>>>>> If I use the implicit ALS, how to distinguish the second and the third 
>>>>> kind
>>>>> of pair:)
>>>>> 
>>>>> My opinion is in my case, I should use explicit ALS ...
>>>>> 
>>>>> Thank you so much
>>>>> 
>>>>> ?? 2015??2??26????22:41??Xiangrui Meng <m...@databricks.com 
>>>>> <mailto:m...@databricks.com>> ??????
>>>>> 
>>>>> Lisen, did you use all m-by-n pairs during training? Implicit model
>>>>> penalizes unobserved ratings, while explicit model doesn't. -Xiangrui
>>>>> 
>>>>>> On Feb 26, 2015 6:26 AM, "Sean Owen" <so...@cloudera.com 
>>>>>> <mailto:so...@cloudera.com>> wrote:
>>>>>> 
>>>>>> +user
>>>>>> 
>>>>>>> On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen <so...@cloudera.com 
>>>>>>> <mailto:so...@cloudera.com>> wrote:
>>>>>>> 
>>>>>>> I think I may have it backwards, and that you are correct to keep the 0
>>>>>>> elements in train() in order to try to reproduce the same result.
>>>>>>> 
>>>>>>> The second formulation is called 'weighted regularization' and is used
>>>>>>> for both implicit and explicit feedback, as far as I can see in the 
>>>>>>> code.
>>>>>>> 
>>>>>>> Hm, I'm actually not clear why these would produce different results.
>>>>>>> Different code paths are used to be sure, but I'm not yet sure why they
>>>>>>> would give different results.
>>>>>>> 
>>>>>>> In general you wouldn't use train() for data like this though, and would
>>>>>>> never set alpha=0.
>>>>>>> 
>>>>>>>> On Thu, Feb 26, 2015 at 2:15 PM, lisendong <lisend...@163.com 
>>>>>>>> <mailto:lisend...@163.com>> wrote:
>>>>>>>> 
>>>>>>>> I want to confirm the loss function you use (sorry I??m not so familiar
>>>>>>>> with scala code so I did not understand the source code of mllib)
>>>>>>>> 
>>>>>>>> According to the papers :
>>>>>>>> 
>>>>>>>> 
>>>>>>>> in your implicit feedback ALS, the loss function is (ICDM 2008):
>>>>>>>> 
>>>>>>>> in the explicit feedback ALS, the loss function is (Netflix 2008):
>>>>>>>> 
>>>>>>>> note that besides the difference of confidence parameter Cui, the
>>>>>>>> regularization is also different.  does your code also has this 
>>>>>>>> difference?
>>>>>>>> 
>>>>>>>> Best Regards,
>>>>>>>> Sendong Li
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> ?? 2015??2??26????????9:42??lisendong <lisend...@163.com 
>>>>>>>>> <mailto:lisend...@163.com>> ??????
>>>>>>>>> 
>>>>>>>>> Hi meng, fotero, sowen:
>>>>>>>>> 
>>>>>>>>> I??m using ALS with spark 1.0.0, the code should be:
>>>>>>>>> 
>>>>>>>>> https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
>>>>>>>>>  
>>>>>>>>> <https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala>
>>>>>>>>> 
>>>>>>>>> I think the following two method should produce the same (or near)
>>>>>>>>> result:
>>>>>>>>> 
>>>>>>>>> MatrixFactorizationModel model = ALS.train(ratings.rdd(), 30, 30, 
>>>>>>>>> 0.01,
>>>>>>>>> -1, 1);
>>>>>>>>> 
>>>>>>>>> MatrixFactorizationModel model = ALS.trainImplicit(ratings.rdd(), 30,
>>>>>>>>> 30, 0.01, -1, 0, 1);
>>>>>>>>> 
>>>>>>>>> the data I used is display log, the format of log is as following:
>>>>>>>>> 
>>>>>>>>> user  item  if-click
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I use 1.0 as score for click pair, and 0 as score for non-click pair.
>>>>>>>>> 
>>>>>>>>> in the second method, the alpha is set to zero, so the confidence for
>>>>>>>>> positive and negative are both 1.0 (right?)
>>>>>>>>> 
>>>>>>>>> I think the two method should produce similar result, but the result 
>>>>>>>>> is
>>>>>>>>> :  the second method??s result is very bad (the AUC of the first 
>>>>>>>>> result is
>>>>>>>>> 0.7, but the AUC of the second result is only 0.61)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I could not understand why, could you help me?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thank you very much!
>>>>>>>>> 
>>>>>>>>> Best Regards,
>>>>>>>>> Sendong Li
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>> 
>> 
>>  
>> ????????????????????????????????????????????????????????????????????????????????????
>> ???? 3 ??????
>> PastedGraphic-2.tiff(48K)
>> ???????? 
>> <http://preview.mail.163.com/xdownload?filename=PastedGraphic-2.tiff&mid=1tbiyBrMDVEAMpbFKgAAsJ&part=3&sign=cca8b2e547991f21222b2755d4e03f4d&time=1427731931&uid=lisendong%40163.com>
>> PastedGraphic-1.tiff(139K)
>> ???????? 
>> <http://preview.mail.163.com/xdownload?filename=PastedGraphic-1.tiff&mid=1tbiyBrMDVEAMpbFKgAAsJ&part=4&sign=cca8b2e547991f21222b2755d4e03f4d&time=1427731931&uid=lisendong%40163.com>
>> PastedGraphic-3.tiff(81K)
>> ???????? 
>> <http://preview.mail.163.com/xdownload?filename=PastedGraphic-3.tiff&mid=1tbiyBrMDVEAMpbFKgAAsJ&part=5&sign=cca8b2e547991f21222b2755d4e03f4d&time=1427731931&uid=lisendong%40163.com>

Re: different result from implicit ALS with explicit ALS

Reply via email to