Re: Recommender for news articles based on own user profile (URL history)

2014-02-16 Thread Pat Ferrel
In the simple case I’m not sure a collaborative filtering recommender is going 
to work here. The items change too quickly to gather significant preference 
data. Articles are your items, what is their lifetime? To do CF you need 
relatively long-lived items and enough user preference data about those items. 

There are other way to tackle this. Let’s take Google alerts as an example. 
They start with search text. I created one with the text “machine learning” and 
got some silly alerts: 
http://occamsmachete.com/ml/2012/03/16/fun-with-google-alerts/

But what they do is track every time you follow a link from their recs email. 
Then they train a classifier with all of the text you read. The start is pretty 
awful but they get better very quickly. I’m sure they do some things to make 
this more scalable but that’s a longer story. There is a CF angle with enough 
technology (read on).

Can you do the same thing? If you can tell what articles people read you can 
use this collection as a content exemplar and recommend new news items based on 
similarity to this collection.

To use the GA template:
1) use Solr to recommend articles from a user’s tweets (they may be awful at 
first)
2) track what they read and keep it as an example of the type of thing they like
3) when new articles come in, find the people who like that sort of thing and 
make them aware of it. You do this by comparing the new article with each of 
the user’s collection of past reads. You can do this with Solr for ease and 
simplicity but batch classification will probably give better results.

Some have used Named Entities in news and Tweets to make CF based recs. If you 
knew one named entity in an article was ‘Putin' you could treat it as an item 
and gather CF data from people who read about him. With enough history like 
that you could build a CF type recommender. It wouldn’t surprise me if Google 
isn’t doing something with this in a lot of their search products, like alerts.
 
On Feb 16, 2014, at 11:51 AM, Juanjo Ramos  wrote:

As per your question, we have not built anything yet
so, we are dealing with that problem: How to let the tweets 
drive the recommendation of the news to be viewed.

The original idea was to find item-item similarity between the 
user tweets and the news in order to deal with the cold-start 
problem and infer some initial preference of the users 
and the news based on that item-item similarity. This is where 
my original idea of  using RowSimilarityJob to compute the matrix 
of similarities came into place. 
Later, as the user accesses different news those preferences 
will we tuned as in a regular item-based recommender.

Since the system has not been built yet, our first goal is to design
the architecture of the system first and how it should respond after
new tweets are produced, even if the performance is not the best
in this first version. Then, we will focus on the particular problem 
of using tweets to recommend news, for which the links you posted 
will be extremely helpful.

I am new to Mahout. I have just finished reading 'Mahout in Action'
and that is why I tried to use only Mahout for the implementation,
but the approach you suggest with Solr seems more reasonable
to deal with the problem of having the system responding and
adapting fast when new tweets are produced.

Thanks again.




Alternative input formats for Distributed REcommenders.

2014-02-16 Thread Jay Vyas
Does mahout have any kind of record transformer or reader API so that I can
use existing files, that arent perfectly formatted, as input to the
recommenders?

The Recommender's desired input data set has format:

jay, skis, .2
jay, iphone, .3


Instead I have:

jay,  xbffX, skis, .2
jay,   x123x, iphone, .3

So I'd like to tell the recommender engine at runtime to read in fields 0,
2, and 3, skipping the garbage text in column 1.

Any ideas on how to handle this without having to write a mapreduce job
just to scrape 3 out of the 4  columns out of the file?

-- 
Jay Vyas
http://jayunit100.blogspot.com


Re: Recommender for news articles based on own user profile (URL history)

2014-02-16 Thread Juanjo Ramos
As per your question, we have not built anything yet
so, we are dealing with that problem: How to let the tweets 
drive the recommendation of the news to be viewed.

The original idea was to find item-item similarity between the 
user tweets and the news in order to deal with the cold-start 
problem and infer some initial preference of the users 
and the news based on that item-item similarity. This is where 
my original idea of  using RowSimilarityJob to compute the matrix 
of similarities came into place. 
Later, as the user accesses different news those preferences 
will we tuned as in a regular item-based recommender.

Since the system has not been built yet, our first goal is to design
the architecture of the system first and how it should respond after
new tweets are produced, even if the performance is not the best
in this first version. Then, we will focus on the particular problem 
of using tweets to recommend news, for which the links you posted 
will be extremely helpful.

I am new to Mahout. I have just finished reading 'Mahout in Action'
and that is why I tried to use only Mahout for the implementation,
but the approach you suggest with Solr seems more reasonable
to deal with the problem of having the system responding and
adapting fast when new tweets are produced.

Thanks again.



Re: Recommender for news articles based on own user profile (URL history)

2014-02-16 Thread Pat Ferrel
The solution you mention doesn’t sound right. You would usually not need to 
create a new ItemSimilarity class unless you have a new way to measure 
similarity.

lets see if I have this right:

1) you want to recommend news
2) recs are based on a user’s tweets
3) you have little metadata about either input or recommended items

You mention that you have previous tweets? Do you know which tweets led to 
which news being viewed? Ar you collecting links in tweets? You can augment 
tweet text with text from the pages linked to.

There are many difficulties in using tweets to recommend news, I’d do some 
research before you start. A quick search got this article 
http://nlp.cs.rpi.edu/paper/tweetnews.pdf which references others.

Also Ken Krugler wrote a series of articles on techniques used to improve text 
to text similarity—make sure to read both. 
http://www.scaleunlimited.com/2013/07/10/text-feature-selection-for-machine-learning-part-1/

Can’t predict where this will end up but an easy thing to do as a trial is 
index news in Solr, use scrubbed tweets as queries. You could set this up in an 
hour or so probably and try it with your own tweets to see how well is does. I 
suspect this won’t be your ultimate solution but it’s easy to do while you get 
your mind around the research

On Feb 16, 2014, at 5:54 AM, Juanjo Ramos  wrote:

Hi Pat,
Thanks so much for your detailed response.

At the moment we do not have any metadata 
about the articles but just their title & body. 
In addition, in the dataset we have tweets from the user which 
will never be in the output of the recommender 
(we never want to recommend a user to see a particular tweet) 
but we will use them to tune the users' 
preferences for different pieces of news based 
on the similarity between the tweets they have 
produced and the news that we have.

Would the approach you suggest with Solr 
still be valid in this particular scenario? We would need the 
user preferences to be updated as soon as they produce 
a new tweet, therefore my urge in recompute 
item-similarities as soon as a new tweet is produced. 
We do not need to recompute the matrix of 
similarities whenever a piece of news is produced 
as you well mentioned.

I do not if the approach I am about to suggest 
even makes sense but my idea was to precompute the 
similarities between items (news + tweets) 
and stored them along with the vectorized representation 
of every item. 
Then, implement my own ItemSimilarity class 
which would return the similarity for 
every pair of items (from the matrix if available) 
or calculated on the fly if not found. My main 
problem here is that I do not know how to calculate 
in Mahout the cosine distance between the 
vectorized representation of 2 particular items. 
Does this approach make sense in the first place?

Many thanks.




Re: Recommender for news articles based on own user profile (URL history)

2014-02-16 Thread Juanjo Ramos
Hi Pat,
Thanks so much for your detailed response.

At the moment we do not have any metadata 
about the articles but just their title & body. 
In addition, in the dataset we have tweets from the user which 
will never be in the output of the recommender 
(we never want to recommend a user to see a particular tweet) 
but we will use them to tune the users' 
preferences for different pieces of news based 
on the similarity between the tweets they have 
produced and the news that we have.

Would the approach you suggest with Solr 
still be valid in this particular scenario? We would need the 
user preferences to be updated as soon as they produce 
a new tweet, therefore my urge in recompute 
item-similarities as soon as a new tweet is produced. 
We do not need to recompute the matrix of 
similarities whenever a piece of news is produced 
as you well mentioned.

I do not if the approach I am about to suggest 
even makes sense but my idea was to precompute the 
similarities between items (news + tweets) 
and stored them along with the vectorized representation 
of every item. 
Then, implement my own ItemSimilarity class 
which would return the similarity for 
every pair of items (from the matrix if available) 
or calculated on the fly if not found. My main 
problem here is that I do not know how to calculate 
in Mahout the cosine distance between the 
vectorized representation of 2 particular items. 
Does this approach make sense in the first place?

Many thanks.