A very good practice is to use a data set like this: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
Segregating by date avoids problems with duplicate documents appearing in both training and test. It also gives you a standard split so that you can compare to other peoples' results. On Thu, Sep 30, 2010 at 7:00 AM, Neil Ghosh <neil.gh...@gmail.com> wrote: > Hi, > > In this example > > https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html > > The test is done on the already classified input text documents. > > My Question is , If I want to test unknown, documents , do I need it in > specific format ? or just keep them (as raw text ) in the input folder > while > testing ? > > Thanks and Regards > Neil > http://neilghosh.com >