A very good practice is to use a data set like this:
http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

Segregating by date avoids problems with duplicate documents appearing in
both training and test.  It also gives you a standard split so that you can
compare to other peoples' results.

On Thu, Sep 30, 2010 at 7:00 AM, Neil Ghosh <neil.gh...@gmail.com> wrote:

> Hi,
>
> In this example
>
> https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html
>
> The test is done on the already classified input text documents.
>
> My Question is , If I want to test unknown, documents , do I need it in
> specific format ? or just keep them (as raw text ) in the input folder
> while
> testing ?
>
> Thanks and Regards
> Neil
> http://neilghosh.com
>

Reply via email to