Usage within AWS is a neighborly thing to do.

But yes, Amazon donates this bandwidth.

On Thu, Oct 13, 2011 at 8:11 PM, Lance Norskog <[email protected]> wrote:

> Is the Apache public download bandwidth donated by Amazon? Or should we try
> to keep usage within AWS?
>
> On Thu, Oct 13, 2011 at 3:47 AM, Grant Ingersoll <[email protected]
> >wrote:
>
> >
> > On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote:
> >
> > > Grant,
> > >
> > > Can you share a little more details about the results, do you get any
> > > exceptions? Or do you just get no results?
> >
> > No results.
> >
> > >
> > > Using the NaNs inside the similarity matrix vectors has been included
> in
> > > the job for a very long time and should not cause any problems. As Sean
> > > already mentioned we have unit tests with toy data that should catch
> the
> > > very obvious errors in this code.
> >
> > Yeah, I don't know what happened.  I know I was getting results as little
> > as two weeks ago.  I will try rolling back to an earlier commit.
> >
> > >
> > > Can you share the dataset? I can do a testrun on my research cluster.
> >
> > I already have earlier in this thread.  There is a small set via the link
> > below or you can use the ASF email public dataset on Amazon or any subset
> of
> > it.
> >
> >
> > >
> > > --sebastian
> > >
> > > On 13.10.2011 08:37, Sean Owen wrote:
> > >> RecommenderJob? The unit tests run it all the time.
> > >> There should not be any glitches with static variables -- don't think
> > >> there are any.
> > >>
> > >> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[email protected]>
> > wrote:
> > >>> Is this job working well for anyone now?
> > >>> When was the last time this job worked for someone?
> > >>>
> > >>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <
> [email protected]
> > >wrote:
> > >>>
> > >>>> Both local and on EC2
> > >>>>
> > >>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
> > >>>>
> > >>>>> Hi Grant,
> > >>>>>
> > >>>>> Just curious, are you running this locally or distributed?
> > >>>>>
> > >>>>> I'd run into a similar issue, though in a completely different
> > algorithm
> > >>>> (Jimmy Lin's PageRank implementation) due to the use of a static
> > variable.
> > >>>>>
> > >>>>> When running locally, this wasn't getting cleared between loops,
> and
> > thus
> > >>>> I got wonky results.
> > >>>>>
> > >>>>> The same thing would have happened with JVM reuse enabled.
> > >>>>>
> > >>>>> -- Ken
> > >>>>>
> > >>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
> > >>>>>
> > >>>>>> Digging some more:
> > >>>>>>
> > >>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0,
> a
> > >>>> simColumn of:
> > >>>>>>
> > >>>>
> >
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
> > >>>>>>
> > >>>>>> Which then becomes the numerator and the denom.
> > >>>>>>
> > >>>>>> Looping, my next simCol is:
> > >>>>>>
> > >>>>
> >
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
> > >>>>>>
> > >>>>>> and then
> > >>>>>>
> > >>>>
> >
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
> > >>>>>>
> > >>>>>> ...
> > >>>>>>
> > >>>>>> Each time, those are getting added into the numerators/denoms
> value,
> > >>>> such that by the time we are done looping (line 161), we have:
> > >>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> > >>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> > >>>>>>
> > >>>>>> numberOfSimilarItemsUsed:
> > >>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
> > >>>>>>
> > >>>>>> Not sure on how to interpret this as I haven't dug into the math
> > here
> > >>>> yet or figured out where those NaN are coming from originally.
> > >>>>>>
> > >>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
> > >>>>>>
> > >>>>>>>
> > >>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
> > >>>>>>>>
> > >>>>>>>>> Where is the NaN coming up -- what has this value?
> > >>>>>>>>
> > >>>>>>>> simColumn seems to be the originator in the Aggregate step.  For
> > >>>> instance, my current breakpoint shows:
> > >>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
> > >>>>>>>>
> > >>>>>>>> I can also see some in the PartialMultiplyMapper via the
> > >>>> similarityMatrixColumn.
> > >>>>>>>>
> > >>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper?
> > >>>>>>>> <code>
> > >>>>>>>> /* remove self similarity */
> > >>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN);
> > >>>>>>>> </code>
> > >>>>>>>
> > >>>>>>> Ah, but that is just taking care of itself, so maybe not the
> issue.
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> It should be propagated in some cases but not others. I'm not
> > aware
> > >>>> of
> > >>>>>>>>> any changes here.
> > >>>>>>>>
> > >>>>>>>> yeah, me neither.  This is all related to MAHOUT-798.
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Generally small data sets will have this problem of not being
> > able to
> > >>>>>>>>> compute much of anything useful, so NaN might be right here.
> > >>>>>>>>> But you say it was different recently, which seems to rule that
> > out.
> > >>>>>>>>
> > >>>>>>>> I also _believe_ I'm seeing it in a much larger data set on
> > Hadoop,
> > >>>> it's just that's a whole lot harder to debug.
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
> > >>>> [email protected]> wrote:
> > >>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and
> am
> > not
> > >>>> getting any recommendations due to NaNs being calculated in the
> > >>>> AggregateAndRecommend step.  I'm not quite sure what is going on as
> it
> > seems
> > >>>> like this was working as little as two weeks ago (post Sebastian's
> big
> > >>>> change to RecJob), but I don't see a whole lot of changes in that
> part
> > of
> > >>>> the code.
> > >>>>>>>>>>
> > >>>>>>>>>> The data is user id's mapping to email thread ids.  My input
> > data is
> > >>>> simply a triple of user id, thread id, 1 (meaning that user
> > participated in
> > >>>> that thread)  It seems like I will have a lot of good values in the
> > inputs
> > >>>> to the AggregateAndRecommend step, except one id will be NaN and
> this
> > then
> > >>>> seems to get added in and makes everything NaN (I realize this is a
> > very
> > >>>> naive understanding).  I sense that I should be looking upstream in
> > the
> > >>>> process for a fix, but I am not sure where that is.
> > >>>>>>>>>>
> > >>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs?
>  If
> > you
> > >>>> want to try this with a small data set, you can get it here:
> > >>>>
> >
> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(butnotethe
>  companion article is not published yet.)
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks,
> > >>>>>>>>>> Grant
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> --------------------------------------------
> > >>>>>>> Grant Ingersoll
> > >>>>>>> http://www.lucidimagination.com
> > >>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> > >>>>>>>
> > >>>>>>
> > >>>>>> --------------------------------------------
> > >>>>>> Grant Ingersoll
> > >>>>>> http://www.lucidimagination.com
> > >>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> > >>>>>>
> > >>>>>
> > >>>>> --------------------------
> > >>>>> Ken Krugler
> > >>>>> +1 530-210-6378
> > >>>>> http://bixolabs.com
> > >>>>> custom big data solutions & training
> > >>>>> Hadoop, Cascading, Mahout & Solr
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>> --------------------------------------------
> > >>>> Grant Ingersoll
> > >>>> http://www.lucidimagination.com
> > >>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Lance Norskog
> > >>> [email protected]
> > >>>
> > >
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com
> > Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >
> >
> >
> >
>
>
> --
> Lance Norskog
> [email protected]
>

Reply via email to