Ok that makes sense.

In general classification methods with text are not super awesome with edge
cases.  The best way to prevent that is to just have a very large training
set, and pick your categories very carefully.

Remember, your just trying to produce something that is "mostly" right.  I
would just accept that the name "Ted Dunning" might get learned as a
feature, and that the probability that it is relevant is worth getting it
wrong infrequently.

If you are using something like an SVM, you can look at the support vectors
and feature weightings to see what the model is learning, and then use that
filter more words from your training set.  For instance, it might be worth
removing names from the training set so that the your model doesn't learn
them.

I know the first time we played with the 20 Newsgroups it was heavily
weighting the email addresses of the people posting--which means that the
model wouldn't generalize well.  So we filtered out email addresses.

Not sure if this is helpful or not.  Just my 2 cents.

Zach


On Mon, Sep 26, 2011 at 11:08 AM, Em <mailformailingli...@yahoo.de> wrote:

> Zach,
>
> thanks for your feedback!
>
> I want to categorize them into a general-purpose category (nothing
> individual).
> The goal is to get an overview about every document that has to do with
> the domain in some way and to throw away everything else.
>
> Regards,
> Em
>
> Am 26.09.2011 17:11, schrieb Zach Richardson:
> > Em,
> >
> > This really all depends on your goal.  Do you want them to be scored as
> > interesting to an individual or do you want them categorized into topics?
> >
> > How you set those problems up can be very different based on the end
> goal.
> >  What is yours?
> >
> > Thanks,
> >
> > Zach
> >
> >
> > On Mon, Sep 26, 2011 at 9:55 AM, Em <mailformailingli...@yahoo.de>
> wrote:
> >
> >> No experiences?
> >>
> >> Regards,
> >> Em
> >>
> >> Am 23.09.2011 12:48, schrieb Em:
> >>> Hello list,
> >>>
> >>> let's say I want to classifiy documents and there are two possible
> >> outcomes:
> >>> Yes, the document belongs to the topic I focus on, or No, it doesn't.
> >>>
> >>> The topic is for example: Machine Learning.
> >>>
> >>> Doc1: A sub-chapter of the book "Mahout in Action"
> >>> Doc2: A paper about clustering-techniques
> >>> Doc3: A Blog-Post of Ted Dunning, Machine-Learning-Expert, talking
> about
> >>> his opinion regarding the relationship between Google and Oracle
> >>> Doc4: Ted Dunning is talking about how to cook tasty spagetti (Sorry
> >>> Ted, you are my guinea pig in this case)
> >>>
> >>> The point is: Doc3 is not really about Machine Learning, however it
> >>> might be relevant for people that are interested in Machine Learning,
> >>> since the author is a Machine-Learning-Expert and his opinion might
> >>> reflect some thoughts regarding that domain.
> >>>
> >>> Doc4 is completely irrelevant. It has to do with Ted Dunning, but not
> >>> with Machine Learning nor software at all. The only exception would be
> >>> if Ted wrote a piece of Machine Learning software that is creating a
> >>> recipe for cooking tasty spagetti ;).
> >>>
> >>> If I change the topic to something like "Star Trek":
> >>>
> >>> Doc1: A review of a Star Trek movie
> >>> Doc2: A Star Trek computer game's description
> >>> Doc3: A review regarding a PlayStation 3 Star Trek game
> >>> Doc4: The announcement that the gaming studio of the Star Trek games is
> >>> going to create a new Star Wars game
> >>> Doc5: A Star Wars book's description
> >>> Doc6: The gaming studio of the Star Trek games is going to create a
> need
> >>> for speed clone
> >>>
> >>> Doc 1,2 and 3 are relevant for Trekkies. Doc 4 might be as well,
> because
> >>> the studio is an authority for creating good Star Trek games and they
> >>> noted that their experiences with Star Trek will help them building a
> >>> good Star Wars game. Some fans might be interested in this.
> >>>
> >>> However doc 5 is completely irrelevant, since it has nothing to do with
> >>> Star Trek.
> >>> Doc 6 is about an authority in the Star Trek merchandise-industry but
> it
> >>> correlates with my Ted-cooks-spagetti example from my first example -
> >>> Doc 6 is irrelevant.
> >>>
> >>> Doc3 of my "Machine Learning" example and Doc 4 of my "Star Trek" one
> >>> are boundary values for beeing relevant. They might interest people
> that
> >>> focus on the two named domains, but they sail very close to the wind.
> >>>
> >>> Does it generally make sense to take such examples into account for
> >>> training a model? Real humans may have a discussion about those
> examples
> >>> whether they really belong to the domain they want to focus on.
> >>>
> >>> Thank you for your advice.
> >>>
> >>> Regards,
> >>> Em
> >>
> >
> >
> >
>



-- 
Zach Richardson
Ravel, Co-founder
Austin, TX
z...@raveldata.com
512.825.6031

Reply via email to