Hi,

Thank you for responding. My Spam filter was "out to get me" and your responses 
were misclassified.

I will investigate the Hadoop integration piece, specifically RecommenderJob. 
Currently, the Hadoop grid I'm working with is using 0.18.3. Will that pose a 
problem? I noticed some threads about versions of Hadoop less than 0.19 not 
working.

We are looking at starting with 70M users and scaling up to 500M eventually. It 
is hard for me to estimate the number of items. We could be starting out with 
100, but as these items are entities that we extract, there could be tens of 
thousands eventually. I would guess that most users would have less than 100 of 
these.

Does that help? I would be interested in your input on the algorithms and also 
being a guinea pig for the code you're developing, if it makes sense.

-Aurora


On 7/23/09 12:43 AM, "Sean Owen" <[email protected]> wrote:

Aurora did you see my last reply on the list?

On Wed, Jul 22, 2009 at 9:29 AM, Sean Owen<[email protected]> wrote:
> Yes, there are a few components here -- a few different purposes. All
> build around the core library which isn't specific to Hadoop or an
> HTTP server, but you've seen some of the components that adapt the
> core to this contexts. There are also components that can evaluate or
> load test the code.
>
> The only piece you are interested in then is really the Hadoop
> integration -- see org.apache.mahout.cf.taste.hadoop. There you will
> find RecommenderJob which should be able to launch a
> pseudo-distributed recommender job. I say pseudo since these
> algorithms are not in general distributable, but, one can of course
> run n instances of a recommender to compute 1/nth of all
> recommendations each. That is nice, though it means, say, the amount
> of RAM the jobs consume is still limited by the size of each machine.
>
> I just recently rewrote this package to be compatible with Hadoop
> 0.20's new APIs. I do not know that it works, and, have some reason to
> believe there are bugs in the API that will prevent it from working.
> So this piece is currently in flux.
>
> If you want to experiment and be a guinea pig for this latest
> revision, I can provide close support to work through the bugs on both
> sides. Or we can talk about your requirements more a bit to figure out
> whether this is feasible, what the best algorithm is, whether you need
> Hadoop?
>
> How big is 'massive'? could you reveal how many users, items, and
> user-item preferences to an order of magnitude? what is generally the
> nature of the input data you have, and you want recommendations out?
>
> On Wed, Jul 22, 2009 at 12:12 AM, Aurora
> Skarra-Gallagher<[email protected]> wrote:
>> Hi,
>>
>> I apologize if I've misunderstood the purpose of the Taste component of 
>> Mahout. Our goal was to take a recommendation framework and use our own 
>> recommendation algorithm within it. We need to process a massive amount of 
>> data, and wanted it to be done on our Hadoop grid. I thought that Taste was 
>> the right fit for the job. I'm not interested in the HTTP service. I'm 
>> interested in the recommendation framework, particularly from a back-end 
>> batch perspective. Does that help clarify? Thanks for helping me sort 
>> through this.
>>
>> -Aurora
>>
>>
>> On 7/21/09 3:02 PM, "Sean Owen" <[email protected]> wrote:
>>
>> Hmm, lots going on here, it's confusing.
>>
>> Are you trying to run this on Hadoop intentionally? because the web
>> app example is not intended to run on Hadoop. It's a component
>> intended to serve recommendations over HTTP in real time. It also
>> appears you are running an evaluation rather than a web app serving
>> requests. I realize you're trying to run this without Jetty, but
>> that's kind of like trying to run a web app without a web server.
>>
>> I think you'd have to clarify what you are trying to do, and then what
>> you are doing right now, to begin to assist.
>>
>> On Tue, Jul 21, 2009 at 9:20 PM, Aurora
>> Skarra-Gallagher<[email protected]> wrote:
>>> Hi,
>>>
>>> I'm trying to run the taste web example without using jetty. Our gateways 
>>> aren't meant to be used as webservers. By poking around, I found that the 
>>> following command worked:
>>> hadoop --config ~/hod-clusters/test jar 
>>> /x/mahout-current/examples/target/mahout-examples-0.2-SNAPSHOT.job 
>>> org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommenderEvaluatorRunner
>>>
>>> The output is:
>>> 09/07/21 19:59:21 INFO file.FileDataModel: Creating FileDataModel for file 
>>> /tmp/ratings.txt
>>> 09/07/21 19:59:21 INFO eval.AbstractDifferenceRecommenderEvaluator: 
>>> Beginning evaluation using 0.9 of GroupLensDataModel
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Reading file info...
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Processed 100000 lines
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Processed 200000 lines
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Processed 300000 lines
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Processed 400000 lines
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Processed 500000 lines
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Processed 600000 lines
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Processed 700000 lines
>>> 09/07/21 19:59:22 INFO file.FileDataModel: Processed 800000 lines
>>> 09/07/21 19:59:23 INFO file.FileDataModel: Processed 900000 lines
>>> 09/07/21 19:59:23 INFO file.FileDataModel: Processed 1000000 lines
>>> 09/07/21 19:59:23 INFO file.FileDataModel: Read lines: 1000209
>>> 09/07/21 19:59:30 INFO slopeone.MemoryDiffStorage: Building average diffs...
>>> 09/07/21 19:59:42 INFO eval.AbstractDifferenceRecommenderEvaluator: 
>>> Evaluation result: 0.7035965559003973
>>> 09/07/21 19:59:42 INFO grouplens.GroupLensRecommenderEvaluatorRunner: 
>>> 0.7035965559003973
>>>
>>> The job appears to write data to /tmp/ratings.txt and /tmp/movies.txt. I'm 
>>> not sure if this is the correct way to run this example. I have a few 
>>> questions:
>>>
>>>  1.  Is the output file /tmp/ratings.txt? If so, how do I interpret it?
>>>  2.  What does the Evaluation result mean?
>>>  3.  Is it even running on HDFS?
>>>  4.  Is it a map-reduce job?
>>>
>>> Any pointers on how to run this as a standalone job would be helpful.
>>>
>>> Thanks,
>>> Aurora
>>>
>>
>>
>

Reply via email to