I don't think a MapFile is a good solution as the file would have to be
accessed for every Reducer invocation to load the filter items for that
user. Correct me if I'm wrong.
--sebastian
Am 24.08.2010 15:45, schrieb han henry:
> For 1) , user's invalid items can store in multiple files, we use use
> MapFilesMap to load the data from HDFS,
> then we can check the invalid items.
>
> package org.apache.mahout.cf.taste.hadoop;
>
> import java.io.Closeable;
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.List;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FileStatus;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.PathFilter;
> import org.apache.hadoop.io.MapFile.Reader;
> import org.apache.hadoop.io.Writable;
> import org.apache.hadoop.io.WritableComparable;
> import org.slf4j.Logger;
> import org.slf4j.LoggerFactory;
>
> public final class MapFilesMap<K extends WritableComparable, V extends
> Writable>
> implements Closeable
> {
> private static final Logger log =
> LoggerFactory.getLogger(MapFilesMap.class);
>
> private static final PathFilter PARTS_FILTER = new PathFilter()
> {
> public boolean accept(Path path) {
> return path.getName().startsWith("part-");
> }
> };
> private final List<MapFile.Reader> readers;
>
> public MapFilesMap(FileSystem fs, Path parentDir, Configuration
> conf) throws IOException
> {
> log.info <http://log.info>("Creating MapFileMap from parent
> directory {}", parentDir);
> this.readers = new ArrayList();
> try {
> for (FileStatus status : fs.listStatus(parentDir, PARTS_FILTER)) {
> String path = status.getPath().toString();
> log.info <http://log.info>("Adding MapFile.Reader at {}", path);
> this.readers.add(new MapFile.Reader(fs, path, conf));
> }
> } catch (IOException ioe) {
> close();
> throw ioe;
> }
> if (this.readers.isEmpty())
> throw new IllegalArgumentException("No MapFiles found in " +
> parentDir);
> }
>
> public V get(K key, V value)
> throws IOException
> {
> for (MapFile.Reader reader : this.readers)
> {
> Writable theValue;
> if ((theValue = reader.get(key, value)) != null) {
> return theValue;
> }
> }
> log.debug("No value for key {}", key);
> return null;
> }
>
> public void close()
> {
> for (MapFile.Reader reader : this.readers)
> try {
> reader.close();
> }
> catch (IOException ioe)
> {
> }
> }
> }
>
>
>
> 2010/8/24 Sebastian Schelter <[email protected] <mailto:[email protected]>>
>
> Ok, you guys got me convinced :)
>
> From a technical point of view two ways to implement that filter
> come to
> my mind:
>
> 1) Just load the user/item pairs to filter into memory in the
> AggregateAndRecommendReducer (easy but might not be scalable) like Han
> Hui suggested
> 2) Have the AggregateAndRecommendReducer not pick only the top-K
> recommendations but write all predicted preferences to disk. Add
> another
> M/R step after that which joins recommendations and user/item filter
> pairs to allow for custom rescoring/filtering
>
> --sebastian
>
> Am 24.08.2010 06:07, schrieb Ted Dunning:
> > Sorry to chime in late, but removing items after recommendation
> isn't such a
> > crazy thing to do.
> >
> > In particular, it is common to remove previously viewed items
> (for a period
> > of time). Likewise, it the user says "don't show this again",
> it makes
> > sense to backstop the actual recommendation system with a UI
> limitation that
> > does a post-recommendation elimination.
> >
> > Moreover, this approach has the great benefit that the results
> are very
> > predictable. Exactly the requested/seen items will be
> eliminated and no
> > surprising effect on recommendations will occur.
> >
> > That predictability is exactly the problem, though. Generally
> you want a
> > bit more systemic effect for negative recommendations. This is
> a really
> > sticky area, however, because negative recommendations often impart
> > information about positive preferences in addition to some level
> of negative
> > information.
> >
> > I used an explicit filter at both Musicmatch and at Veoh. Both
> systems
> > worked well. Especially at Veoh, there was a lot of additional
> machinery
> > required to handle the related problem of anti-flooding. That
> was done at
> > the UI level as well.
> >
> > On Mon, Aug 23, 2010 at 8:16 PM, Sean Owen <[email protected]
> <mailto:[email protected]>> wrote:
> >
> >
> >> (Uncanny, I was just minutes before researching Grooveshark for
> >> unrelated reasons... Good to hear from any company doing
> >> recommendations and is willing to talk about it. I know of a number
> >> that can't or won't unfortunately.)
> >>
> >> Yeah, sounds like we're all on the same page. One key point in
> what I
> >> think everyone is talking about is that this is not simply removing
> >> items *after* recommendations are computed. This risks removing
> most
> >> or all recommended items. It needs to be done during the process of
> >> selecting recommendations.
> >>
> >> But beyond that, it's a simple idea and just a question of
> >> implementation. It's "Rescorer" in the non-Hadoop code, which does
> >> more than provide a way to remove items but rather generally
> rearrange
> >> recommendations according to some logic. I think it's likely
> easy and
> >> useful to imitate this with a simple optional Mapper/Reducer
> phase in
> >> this nascent "RecommenderJob" pipeline that Sebastian is now
> helping
> >> expand into something more configurable and general purpose.
> >>
> >> Sean
> >>
> >> On Mon, Aug 23, 2010 at 8:25 PM, Chris Bates
> >> <[email protected]
> <mailto:[email protected]>> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I'm new to this forum and haven't seen the code you are
> talking about, so
> >>> take this with a grain of salt. The way we handle "banned
> items" at
> >>> Grooveshark is to post-process the itemID pairs in Hive. If a
> user
> >>>
> >> dislikes
> >>
> >>> a recommended song/artist, an item pair is stored in HDFS and
> then when
> >>>
> >> the
> >>
> >>> recs are computed, those banned user-item pairs are taken into
> account.
> >>> Here is an example query:
> >>>
> >>> SELECT DISTINCT st.uid, st.simuid, IF(b.uid=st.uid,1,0) as
> banned FROM
> >>> streams_u2u st LEFT OUTER JOIN bannedsimusers b ON
> (b.simuid=st.simuid);
> >>>
> >>> That query will print out a 1 or a 0 if the recommended item
> pair is
> >>>
> >> banned
> >>
> >>> or not. Hive also supports case statements (I think), so you
> can make a
> >>> range of "banned-ness" I guess. Just another solution to the
> "dislike"
> >>> problem.
> >>>
> >>> Chris
> >>>
> >>
> >
>
>