cd mahout/examples/bin
./build-asf-email.sh content/ out/ over/
select 1 for recommender

where content/ is
content/coccoon.apache.org
content/commons.apache.org

and out/ and over/ are output directories. Run the shell script with -x as
you will probably have to tweak it.

Lance

On Thu, Oct 13, 2011 at 11:04 PM, Sebastian Schelter <s...@apache.org> wrote:

> Only got the raw data, how did you convert it to our standard
> recommender input?
>
> --sebastian
>
>
> On 14.10.2011 01:17, Grant Ingersoll wrote:
> > Were you able to get the data, Sebastian?
> >
> > On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote:
> >
> >> Grant,
> >>
> >> Can you share a little more details about the results, do you get any
> >> exceptions? Or do you just get no results?
> >>
> >> Using the NaNs inside the similarity matrix vectors has been included in
> >> the job for a very long time and should not cause any problems. As Sean
> >> already mentioned we have unit tests with toy data that should catch the
> >> very obvious errors in this code.
> >>
> >> Can you share the dataset? I can do a testrun on my research cluster.
> >>
> >> --sebastian
> >>
> >> On 13.10.2011 08:37, Sean Owen wrote:
> >>> RecommenderJob? The unit tests run it all the time.
> >>> There should not be any glitches with static variables -- don't think
> >>> there are any.
> >>>
> >>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <goks...@gmail.com>
> wrote:
> >>>> Is this job working well for anyone now?
> >>>> When was the last time this job worked for someone?
> >>>>
> >>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <
> gsing...@apache.org>wrote:
> >>>>
> >>>>> Both local and on EC2
> >>>>>
> >>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
> >>>>>
> >>>>>> Hi Grant,
> >>>>>>
> >>>>>> Just curious, are you running this locally or distributed?
> >>>>>>
> >>>>>> I'd run into a similar issue, though in a completely different
> algorithm
> >>>>> (Jimmy Lin's PageRank implementation) due to the use of a static
> variable.
> >>>>>>
> >>>>>> When running locally, this wasn't getting cleared between loops, and
> thus
> >>>>> I got wonky results.
> >>>>>>
> >>>>>> The same thing would have happened with JVM reuse enabled.
> >>>>>>
> >>>>>> -- Ken
> >>>>>>
> >>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
> >>>>>>
> >>>>>>> Digging some more:
> >>>>>>>
> >>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
> >>>>> simColumn of:
> >>>>>>>
> >>>>>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
> >>>>>>>
> >>>>>>> Which then becomes the numerator and the denom.
> >>>>>>>
> >>>>>>> Looping, my next simCol is:
> >>>>>>>
> >>>>>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
> >>>>>>>
> >>>>>>> and then
> >>>>>>>
> >>>>>
> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
> >>>>>>>
> >>>>>>> ...
> >>>>>>>
> >>>>>>> Each time, those are getting added into the numerators/denoms
> value,
> >>>>> such that by the time we are done looping (line 161), we have:
> >>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> >>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
> >>>>>>>
> >>>>>>> numberOfSimilarItemsUsed:
> >>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
> >>>>>>>
> >>>>>>> Not sure on how to interpret this as I haven't dug into the math
> here
> >>>>> yet or figured out where those NaN are coming from originally.
> >>>>>>>
> >>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
> >>>>>>>>>
> >>>>>>>>>> Where is the NaN coming up -- what has this value?
> >>>>>>>>>
> >>>>>>>>> simColumn seems to be the originator in the Aggregate step.  For
> >>>>> instance, my current breakpoint shows:
> >>>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
> >>>>>>>>>
> >>>>>>>>> I can also see some in the PartialMultiplyMapper via the
> >>>>> similarityMatrixColumn.
> >>>>>>>>>
> >>>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper?
> >>>>>>>>> <code>
> >>>>>>>>> /* remove self similarity */
> >>>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN);
> >>>>>>>>> </code>
> >>>>>>>>
> >>>>>>>> Ah, but that is just taking care of itself, so maybe not the
> issue.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> It should be propagated in some cases but not others. I'm not
> aware
> >>>>> of
> >>>>>>>>>> any changes here.
> >>>>>>>>>
> >>>>>>>>> yeah, me neither.  This is all related to MAHOUT-798.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Generally small data sets will have this problem of not being
> able to
> >>>>>>>>>> compute much of anything useful, so NaN might be right here.
> >>>>>>>>>> But you say it was different recently, which seems to rule that
> out.
> >>>>>>>>>
> >>>>>>>>> I also _believe_ I'm seeing it in a much larger data set on
> Hadoop,
> >>>>> it's just that's a whole lot harder to debug.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
> >>>>> gsing...@apache.org> wrote:
> >>>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and
> am not
> >>>>> getting any recommendations due to NaNs being calculated in the
> >>>>> AggregateAndRecommend step.  I'm not quite sure what is going on as
> it seems
> >>>>> like this was working as little as two weeks ago (post Sebastian's
> big
> >>>>> change to RecJob), but I don't see a whole lot of changes in that
> part of
> >>>>> the code.
> >>>>>>>>>>>
> >>>>>>>>>>> The data is user id's mapping to email thread ids.  My input
> data is
> >>>>> simply a triple of user id, thread id, 1 (meaning that user
> participated in
> >>>>> that thread)  It seems like I will have a lot of good values in the
> inputs
> >>>>> to the AggregateAndRecommend step, except one id will be NaN and this
> then
> >>>>> seems to get added in and makes everything NaN (I realize this is a
> very
> >>>>> naive understanding).  I sense that I should be looking upstream in
> the
> >>>>> process for a fix, but I am not sure where that is.
> >>>>>>>>>>>
> >>>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs?
>  If you
> >>>>> want to try this with a small data set, you can get it here:
> >>>>>
> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(butnote
>  the companion article is not published yet.)
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Grant
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> --------------------------------------------
> >>>>>>>> Grant Ingersoll
> >>>>>>>> http://www.lucidimagination.com
> >>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>>>>>>>
> >>>>>>>
> >>>>>>> --------------------------------------------
> >>>>>>> Grant Ingersoll
> >>>>>>> http://www.lucidimagination.com
> >>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>>>>>>
> >>>>>>
> >>>>>> --------------------------
> >>>>>> Ken Krugler
> >>>>>> +1 530-210-6378
> >>>>>> http://bixolabs.com
> >>>>>> custom big data solutions & training
> >>>>>> Hadoop, Cascading, Mahout & Solr
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --------------------------------------------
> >>>>> Grant Ingersoll
> >>>>> http://www.lucidimagination.com
> >>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Lance Norskog
> >>>> goks...@gmail.com
> >>>>
> >>
> >
> > --------------------------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com
> > Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >
> >
>
>


-- 
Lance Norskog
goks...@gmail.com

Reply via email to