date:20100222

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-22 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836543#action_12836543
]

Ankur commented on MAHOUT-305:
--

Sean, Thanks for filing the jira. Nothing points from our discussion here.

1. Need to decide on the dataset to run both the implementations on. I have
netflix dataset in mind but a strange thing I observed during my tests with it
is that there were 2 - 3 users who rated more than 10,000 movies! This seemed a
little odd to me. Can you or some else who has had experience with the dataset
validate my observation ?

2. Both the implementations need to run on dataset in the identical environment
to gauge performance and accuracy. For accuracy I believe we need to do a
Precision-Recall test. My understanding of it is that

a) Do a 80-20 split of the data (80% train and 20% test) with split
happening on a timeline.
b) Feed training data to the algorithm and generate recommendations for a
subset of users from training data.
c) Compare those recommendations with items actually present in the
history of user in test data.
d) Calculate precision = tp / (tp + fp) = (recommendations actually
present in user's history) / (total items recommended)
e) Calculate recall = tp / (tp + fn) =(recommendations actually
present in user's history) / (total items in user's history)
f) Finally take a simple avg of both across all the users to get approx
global precision/recall.

please feel free to correct any of the step above if I misunderstood anything.

Combine both cooccurrence-based CF M/R jobs
---

Key: MAHOUT-305
URL: https://issues.apache.org/jira/browse/MAHOUT-305
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Priority: Minor

We have two different but essentially identical MapReduce jobs to make
recommendations based on item co-occurrence:
org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be
merged. Not sure exactly how to approach that but noting this in JIRA, per
Ankur.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Algorithm implementations in Pig

2010-02-22 Thread Jeff Zhang

Hi,

Glad to hear here that mahout devs are interested in pig. Actually I believe
pig is very helpful when you want to quickly implement a prototype of
machine learning algorithms. And Pig has java API, it is easy to integrate
pig script with java.  Maybe we can start with implementing NB using pig
first.



On Mon, Feb 22, 2010 at 3:56 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 I have had both positive and negative results with PIG.

 The positive results were that I was able to express large recommendation
 computations in a very concise way.  That was really helpful.

 My negative results have been to do with the brittle nature of PIG vis a
 vis
 the version of the underlying hadoop system.  That problem may have abated
 somewhat as everybody in the world except me and Amazon's EMR has pretty
 much piled up on version 20.  I also know little about how Pig would
 interface well with other components.  I know that I have had difficulty in
 the past injecting outside information into Pig, but that has been
 improved.  I also know that Pigs eat anything, but have no clear idea how
 well this would play out with, say, our vector formats and vectorizers.

 Ankur, what recent experience do you have?  How well do PIG scripts play
 with other programs any more?

 On Sun, Feb 21, 2010 at 11:41 PM, Ankur C. Goel gan...@yahoo-inc.com
 wrote:

  I had Sean's opinion on this and he was not too comfortable with the Idea
  of having things in different languages in Mahout. However, given the
  benefits of PIG, I feel otherwise. I may be biased here due to my own
  experience of being able to do more in lesser time in Pig then in  M/R,
 so I
  thought let me ask how folks feel.
 
  Ted, I believe you have some PIG experience yourself so any thoughts on
  this ?
 



 --
 Ted Dunning, CTO
 DeepDyve




-- 
Best Regards

Jeff Zhang

Re: Algorithm implementations in Pig

2010-02-22 Thread Ted Dunning

I see pig as useful for data preparation, but for any numerical tasks, it is
likely to be completely hopeless.

On Mon, Feb 22, 2010 at 12:16 AM, Jeff Zhang zjf...@gmail.com wrote:


 Glad to hear here that mahout devs are interested in pig. Actually I
 believe
 pig is very helpful when you want to quickly implement a prototype of
 machine learning algorithms. And Pig has java API, it is easy to integrate
 pig script with java.  Maybe we can start with implementing NB using pig
 first.




-- 
Ted Dunning, CTO
DeepDyve

Re: Algorithm implementations in Pig

2010-02-22 Thread Jeff Zhang

Pig can only make the implementation of map-reduce easier, the numerical
computation can been done in UDF. And piglet is a DSL upon pig latin which
make pig support loop.
http://github.com/iconara/piglet



On Mon, Feb 22, 2010 at 4:25 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 I see pig as useful for data preparation, but for any numerical tasks, it
 is
 likely to be completely hopeless.

 On Mon, Feb 22, 2010 at 12:16 AM, Jeff Zhang zjf...@gmail.com wrote:

 
  Glad to hear here that mahout devs are interested in pig. Actually I
  believe
  pig is very helpful when you want to quickly implement a prototype of
  machine learning algorithms. And Pig has java API, it is easy to
 integrate
  pig script with java.  Maybe we can start with implementing NB using pig
  first.




 --
 Ted Dunning, CTO
 DeepDyve




-- 
Best Regards

Jeff Zhang

Re: Algorithm implementations in Pig

2010-02-22 Thread Robin Anil

On Mon, Feb 22, 2010 at 1:55 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 I see pig as useful for data preparation, but for any numerical tasks, it
 is
 likely to be completely hopeless.


PIG will be a great tool to experiment quickly on algorithms.  But, with
people here trying to focus on using Vector to standardize the input output
process, It will be tough for the small bunch here to port that to PIG, or
help PIG scripts reuse it. As long as the input output of PIG based
algorithmns is based on VectorWritable, I dont see any problem not including
PIG. But bear in mind the previous PIG submission
https://issues.apache.org/jira/browse/MAHOUT-106 still haven't moved in to
the trunk. If anyone is willing to help standardize on using PIG with
vectors as input they are more than welcome.

One thing we definitely dont want to do at this point is for all algorithms
to have all different kinds of input format.

Robin

Re: Algorithm implementations in Pig

2010-02-22 Thread Ankur C. Goel

Ted,
 The latest pig release 0.6.0 on hadoop 20 is a clear winner not just for 
performance but also for doing a better job of managing memory in its MR job 
pipeline. Also support for both inner and outer skewed join is something that I 
found indispensable when dealing with really large datasets. There is support 
for streaming in pig that lets you stream your relation through an external 
perl/python/ruby... Script. Also support for UDFs in scripting language is 
expected in the near future.

About interfacing with other systems I assume you have an RDBMS in mind. There 
is a patch (for pig 0.7) that lets you write directly from PIG to an RDBMS like 
MySQL. Support for writing directly to Hbase was always there and has been 
improved I believe. With 0.7 release pig has decided to let its load/store 
functions rely on hadoop's input/output format so our vector format shouldn't 
be a problem IMHO. The only thing I am concerned about is the not too 
efficient Tuple implementation in pig which does not give performance 
equivalent to Java MR.

Recently I implemented shingling in Pig and found it to work beautifully. One 
problem that I hit had to too with using clusters to generate recommendations 
since some clusters were quite large ( 10 K). For this I needed to do a 
self-join and wanted the join load to be split evenly. That's where skewed join 
came to the rescue.

Apart from this I also want to contribute my implementation to Mahout (the 
reason for starting this thread :-))

-...@nkur

On 2/22/10 1:26 PM, Ted Dunning ted.dunn...@gmail.com wrote:

I have had both positive and negative results with PIG.

The positive results were that I was able to express large recommendation
computations in a very concise way.  That was really helpful.

My negative results have been to do with the brittle nature of PIG vis a vis
the version of the underlying hadoop system.  That problem may have abated
somewhat as everybody in the world except me and Amazon's EMR has pretty
much piled up on version 20.  I also know little about how Pig would
interface well with other components.  I know that I have had difficulty in
the past injecting outside information into Pig, but that has been
improved.  I also know that Pigs eat anything, but have no clear idea how
well this would play out with, say, our vector formats and vectorizers.

Ankur, what recent experience do you have?  How well do PIG scripts play
with other programs any more?

On Sun, Feb 21, 2010 at 11:41 PM, Ankur C. Goel gan...@yahoo-inc.comwrote:

 I had Sean's opinion on this and he was not too comfortable with the Idea
 of having things in different languages in Mahout. However, given the
 benefits of PIG, I feel otherwise. I may be biased here due to my own
 experience of being able to do more in lesser time in Pig then in  M/R, so I
 thought let me ask how folks feel.

 Ted, I believe you have some PIG experience yourself so any thoughts on
 this ?




--
Ted Dunning, CTO
DeepDyve

Re: Algorithm implementations in Pig

2010-02-22 Thread Grant Ingersoll

I'm all for Pig, especially once we are a TLP.  I haven't had the proper time 
to review the PLSI implementation, but it looks useful.  I agree on the other 
points, though, in that I think we it would be nice to have consistent formats 
based on Vector so that things can be more portable.


On Feb 22, 2010, at 2:41 AM, Ankur C. Goel wrote:

 Hi Folks,
   I would like to know how mahout community feels about having 
 some of the Mahout algorithms implemented in pig - 
 http://hadoop.apache.org/pig. The benefits of using Pig are many including.
 
 
 1.  Small learning curve, people with a bit of SQL knowledge will find it 
 very easy.
 2.  Operations like grouping, aggregations, join need just few lines of pig 
 code.
 3.  Insulation against Hadoop complexity - Job chains and JobConf.
 4.  Quick prototyping and hence increased programmer productivity.
 
 I had Sean's opinion on this and he was not too comfortable with the Idea of 
 having things in different languages in Mahout. However, given the benefits 
 of PIG, I feel otherwise. I may be biased here due to my own experience of 
 being able to do more in lesser time in Pig then in  M/R, so I thought let me 
 ask how folks feel.
 
 Ted, I believe you have some PIG experience yourself so any thoughts on this ?
 
 Regards
 -...@nkur

60 matches

Mail list logo