> However the sequence of steps as described in Mahout Cookbook seems to me
> incorrect as:

this is entirely possible, that book may be out of date. The end to end 
instructions on the website for the 20 newsgroups example is up to date though. 
 As is the example script. 

You don't want to merge all of the files into one directory, rather to merge 
the training and testing sets in 20news-bydate while maintaining their 
directory structure.  

> After data set download and extraction data are merged via command:
> *cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all*
> 
> Which essentially copies files to a single location -> 20news-all folder

this should not copy all of the *files* individually into the 20news-all folder 
rather the directories containing the files:

    $ ls 20news-all/
    alt.atheism               rec.autos           sci.space
    comp.graphics             rec.motorcycles     soc.religion.christian
    {...}
 
> *./mahout seqdirectory  -i ${WORK_DIR}/20news-all  -o
> ${WORK_DIR}/20news-seq*
> Converts to a hadoop sequence directory from 20news-all dir - where all
> files were copied and efffectively the classification to folders were lost.
> We can peek inside a created seq file via hadoop fs -text
> $WORK_DIR/20news-seq/chunck-0 | more which prints following result:
> 
> */67399* From:xxx
> Subject: Re: Imake-TeX: looking for beta testers
> Organization: CS Department, Dortmund University, Germany
> Lines: 59
> Distribution: world
> NNTP-Posting-Host: tommy.informatik.uni-dortmund.de
> In article <xxxxx>,
> yyy writes:
> |> As I announced at the X Technical Conference in January, I would
> like
> |> to
> |> make Imake-TeX, the Imake support for using the TeX typesetting
> system,
> |> publically available. Currently Imake-TeX is in beta test here at
> the
> |> computer science department of Dortmund University, and I am
> looking
> ...
> 
> To my understanding - number after slash in bold represents a key of
> sequence file, right?

Correct though it should read something like:

    /comp.graphics/67399 {...}

where comp.graphics is the category as well as the directory that it was read 
in from.

> Then seq2sparse is performed:
> 
> ./mahout seq2sparse  -i ${WORK_DIR}/20news-seq vectors -lnorm -nv  -wt
> tfidf -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf
> 
> 
> *Conclusions which I would like to verify:*
> - sequence of steps as described is incorrect - particularly conversion to
> sequence file as the key doesn't contain folder name describing the
> category of training data, or am I still missing something in here?

yes- it looks like you are copying the individual files rather than the 
directories into 20news-all

> 
> - mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o
> ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow
>   What are the exact mechanics when label extraction is performed e.g.
> /category/docID as a key is resolved just to category ???

yes

> Does every time
> the last part after the slash is dropped as a category?? Or is is possible
> to define the strategy somewhere?

The hard-coded convention as of Mahout 0.9 is to extract the label as the first 
string after the key is split on "/".  This makes category organization by 
directory and sequence file conversion with seqdirectory straightforward.  The 
new scala DSL Naive Bayes which is currently in development will allow the user 
more flexibility in extracting the label.

The label extraction process can be found here: 
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/training/IndexInstancesMapper.java

and could me modified if need be.
   
> 
> Thanks
> Jakub
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 1 December 2014 at 17:43, Andrew Palumbo <ap....@outlook.com> wrote:
> 
> > Hi Jakub,
> >
> > The step that you are missing is `$mahout seqdir ...`.   in this step each
> > file in each directory (where the directory is the Category) is converted
> > into a sequence file of form <Text,Text>  where the Text key is
> > /Category/doc_id.
> >
> > `$mahout seq2sparse ...` vectorizes the output of `$mahout seqdir ...`
> > into a sequence file of form <Text, VectorWritable> leaving the Keys
> > unchanged.
> >
> > `$mahout trainnb ... -el ...` then extracts the label from the Keys of the
> > training data ie. the "Category" from /Category/doc_id.
> >
> > please see
> > http://mahout.apache.org/users/classification/twenty-newsgroups.html
> > and http://mahout.apache.org/users/classification/bayesian.html
> > for more information.
> >
> > > Date: Mon, 1 Dec 2014 17:09:55 +0100
> > > Subject: Insights to Naive Bayes classifier example - 20news groups
> > > From: stransky...@gmail.com
> > > To: user@mahout.apache.org
> > >
> > > Hello Mahout experts,
> > >
> > > I am trying to follow some examples provided with Mahout and some
> > features
> > > are not clear to me. It would be great if someone could clarify a bit
> > more.
> > >
> > > To prepare a the data (train and test) the following sequence of steps is
> > > perfomed (taken from mahout cookbook):
> > >
> > > All input is merged into single dir:
> > > *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*
> > >
> > > Converted to hadoop sequence file and then vectorized:
> > > *./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o
> > ${WORK_DIR}/20news-**vectors
> > > -lnorm -nv -wt tfidf*
> > >
> > > Devided to test and train data:
> > > *./mahout split*
> > > *-i ${WORK_DIR}/20news-vectors/tfidf-vectors*
> > > *--trainingOutput ${WORK_DIR}/20news-train-vectors*
> > > *--testOutput ${WORK_DIR}/20news-test-vectors*
> > > *--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential*
> > >
> > > Model is trained:
> > > *./mahout trainnb*
> > > *-i ${WORK_DIR}/20news-train-vectors -el*
> > > *-o ${WORK_DIR}/model*
> > > *-li ${WORK_DIR}/labelindex*
> > > *-ow*
> > >
> > >
> > > What I am missing here and that is subject of my question is: Where is
> > the
> > > category assigned to the testing data to train the categorization? What I
> > > would expect is that there will be vector which says that this document
> > > belongs to a particular category. This seems to me has been ereased by
> > > first step where we mixed all the data to create our corpus. I would
> > still
> > > expect that this information will be somewhere retained. Instead the
> > > messages looks as follows:
> > >
> > > From: y...@a.cs.okstate.edu (YEO YEK CHONG)
> > > Subject: Re: Is "Kermit" available for Windows 3.0/3.1?
> > > Organization: Oklahoma State University
> > > Lines: 7
> > >
> > > From article <a4fm3b1w1...@vicuna.ocunix.on.ca>, by Steve Frampton <
> > > framp...@vicuna.ocunix.on.ca>:
> > > > I was wondering, is the "Kermit" package (the actual package, not a
> > >
> > > Yes!  In the usual ftp sites.
> > >
> > > Yek CHong
> > >
> > >
> > > There is no notion from which group this text belongs to. What's the
> > hack!
> > >
> > > Could someone please clarify a bit what's going on as when
> > crosswalidation
> > > is performed - confusion matrix takes into consideration those
> > categories.
> > >
> > > Thanks a lot for helping me out
> > > Jakub
> >
> >
> 
> 
> 
> -- 
> Jakub Stransky
> cz.linkedin.com/in/jakubstransky
                                          

Reply via email to