Usage within AWS is a neighborly thing to do. But yes, Amazon donates this bandwidth.
On Thu, Oct 13, 2011 at 8:11 PM, Lance Norskog <[email protected]> wrote: > Is the Apache public download bandwidth donated by Amazon? Or should we try > to keep usage within AWS? > > On Thu, Oct 13, 2011 at 3:47 AM, Grant Ingersoll <[email protected] > >wrote: > > > > > On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote: > > > > > Grant, > > > > > > Can you share a little more details about the results, do you get any > > > exceptions? Or do you just get no results? > > > > No results. > > > > > > > > Using the NaNs inside the similarity matrix vectors has been included > in > > > the job for a very long time and should not cause any problems. As Sean > > > already mentioned we have unit tests with toy data that should catch > the > > > very obvious errors in this code. > > > > Yeah, I don't know what happened. I know I was getting results as little > > as two weeks ago. I will try rolling back to an earlier commit. > > > > > > > > Can you share the dataset? I can do a testrun on my research cluster. > > > > I already have earlier in this thread. There is a small set via the link > > below or you can use the ASF email public dataset on Amazon or any subset > of > > it. > > > > > > > > > > --sebastian > > > > > > On 13.10.2011 08:37, Sean Owen wrote: > > >> RecommenderJob? The unit tests run it all the time. > > >> There should not be any glitches with static variables -- don't think > > >> there are any. > > >> > > >> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[email protected]> > > wrote: > > >>> Is this job working well for anyone now? > > >>> When was the last time this job worked for someone? > > >>> > > >>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll < > [email protected] > > >wrote: > > >>> > > >>>> Both local and on EC2 > > >>>> > > >>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: > > >>>> > > >>>>> Hi Grant, > > >>>>> > > >>>>> Just curious, are you running this locally or distributed? > > >>>>> > > >>>>> I'd run into a similar issue, though in a completely different > > algorithm > > >>>> (Jimmy Lin's PageRank implementation) due to the use of a static > > variable. > > >>>>> > > >>>>> When running locally, this wasn't getting cleared between loops, > and > > thus > > >>>> I got wonky results. > > >>>>> > > >>>>> The same thing would have happened with JVM reuse enabled. > > >>>>> > > >>>>> -- Ken > > >>>>> > > >>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: > > >>>>> > > >>>>>> Digging some more: > > >>>>>> > > >>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, > a > > >>>> simColumn of: > > >>>>>> > > >>>> > > > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} > > >>>>>> > > >>>>>> Which then becomes the numerator and the denom. > > >>>>>> > > >>>>>> Looping, my next simCol is: > > >>>>>> > > >>>> > > > {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} > > >>>>>> > > >>>>>> and then > > >>>>>> > > >>>> > > > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} > > >>>>>> > > >>>>>> ... > > >>>>>> > > >>>>>> Each time, those are getting added into the numerators/denoms > value, > > >>>> such that by the time we are done looping (line 161), we have: > > >>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > > >>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > > >>>>>> > > >>>>>> numberOfSimilarItemsUsed: > > >>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} > > >>>>>> > > >>>>>> Not sure on how to interpret this as I haven't dug into the math > > here > > >>>> yet or figured out where those NaN are coming from originally. > > >>>>>> > > >>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: > > >>>>>> > > >>>>>>> > > >>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: > > >>>>>>> > > >>>>>>>> > > >>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: > > >>>>>>>> > > >>>>>>>>> Where is the NaN coming up -- what has this value? > > >>>>>>>> > > >>>>>>>> simColumn seems to be the originator in the Aggregate step. For > > >>>> instance, my current breakpoint shows: > > >>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} > > >>>>>>>> > > >>>>>>>> I can also see some in the PartialMultiplyMapper via the > > >>>> similarityMatrixColumn. > > >>>>>>>> > > >>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper? > > >>>>>>>> <code> > > >>>>>>>> /* remove self similarity */ > > >>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN); > > >>>>>>>> </code> > > >>>>>>> > > >>>>>>> Ah, but that is just taking care of itself, so maybe not the > issue. > > >>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>>> It should be propagated in some cases but not others. I'm not > > aware > > >>>> of > > >>>>>>>>> any changes here. > > >>>>>>>> > > >>>>>>>> yeah, me neither. This is all related to MAHOUT-798. > > >>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Generally small data sets will have this problem of not being > > able to > > >>>>>>>>> compute much of anything useful, so NaN might be right here. > > >>>>>>>>> But you say it was different recently, which seems to rule that > > out. > > >>>>>>>> > > >>>>>>>> I also _believe_ I'm seeing it in a much larger data set on > > Hadoop, > > >>>> it's just that's a whole lot harder to debug. > > >>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll < > > >>>> [email protected]> wrote: > > >>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and > am > > not > > >>>> getting any recommendations due to NaNs being calculated in the > > >>>> AggregateAndRecommend step. I'm not quite sure what is going on as > it > > seems > > >>>> like this was working as little as two weeks ago (post Sebastian's > big > > >>>> change to RecJob), but I don't see a whole lot of changes in that > part > > of > > >>>> the code. > > >>>>>>>>>> > > >>>>>>>>>> The data is user id's mapping to email thread ids. My input > > data is > > >>>> simply a triple of user id, thread id, 1 (meaning that user > > participated in > > >>>> that thread) It seems like I will have a lot of good values in the > > inputs > > >>>> to the AggregateAndRecommend step, except one id will be NaN and > this > > then > > >>>> seems to get added in and makes everything NaN (I realize this is a > > very > > >>>> naive understanding). I sense that I should be looking upstream in > > the > > >>>> process for a fix, but I am not sure where that is. > > >>>>>>>>>> > > >>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs? > If > > you > > >>>> want to try this with a small data set, you can get it here: > > >>>> > > > http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(butnotethe > companion article is not published yet.) > > >>>>>>>>>> > > >>>>>>>>>> Thanks, > > >>>>>>>>>> Grant > > >>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>>> -------------------------------------------- > > >>>>>>> Grant Ingersoll > > >>>>>>> http://www.lucidimagination.com > > >>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com > > >>>>>>> > > >>>>>> > > >>>>>> -------------------------------------------- > > >>>>>> Grant Ingersoll > > >>>>>> http://www.lucidimagination.com > > >>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com > > >>>>>> > > >>>>> > > >>>>> -------------------------- > > >>>>> Ken Krugler > > >>>>> +1 530-210-6378 > > >>>>> http://bixolabs.com > > >>>>> custom big data solutions & training > > >>>>> Hadoop, Cascading, Mahout & Solr > > >>>>> > > >>>>> > > >>>>> > > >>>> > > >>>> -------------------------------------------- > > >>>> Grant Ingersoll > > >>>> http://www.lucidimagination.com > > >>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com > > >>>> > > >>>> > > >>> > > >>> > > >>> -- > > >>> Lance Norskog > > >>> [email protected] > > >>> > > > > > > > -------------------------- > > Grant Ingersoll > > http://www.lucidimagination.com > > Lucene Eurocon 2011: http://www.lucene-eurocon.com > > > > > > > > > > > -- > Lance Norskog > [email protected] >
