Author: buildbot Date: Fri May 2 18:00:37 2014 New Revision: 907792 Log: Staging update by buildbot for mahout
Modified: websites/staging/mahout/trunk/content/ (props changed) websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html Propchange: websites/staging/mahout/trunk/content/ ------------------------------------------------------------------------------ --- cms:source-revision (original) +++ cms:source-revision Fri May 2 18:00:37 2014 @@ -1 +1 @@ -1591731 +1591989 Modified: websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html ============================================================================== --- websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html (original) +++ websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html Fri May 2 18:00:37 2014 @@ -236,7 +236,6 @@ <div id="content-wrap" class="clearfix"> <div id="main"> <h1 id="creating-vectors-from-text">Creating vectors from text</h1> -<p>available starting <em>Mahout_0.2</em></p> <p><a name="CreatingVectorsfromText-Introduction"></a></p> <h1 id="introduction">Introduction</h1> <p>For clustering and classifying documents it is usually necessary to convert the raw text @@ -254,10 +253,10 @@ representations from a Lucene (and Solr, <p>For this, we assume you know how to build a Lucene/Solr index. For those who don't, it is probably easiest to get up and running using <a href="http://lucene.apache.org/solr">Solr</a> as it can ingest things like PDFs, XML, Office, etc. and create a Lucene -index. For those wanting to use just Lucene, see the <a href="http://lucene.apache.org/java">Lucene website</a> +index. For those wanting to use just Lucene, see the <a href="http://lucene.apache.org/core">Lucene website</a> or check out <em>Lucene In Action</em> by Erik Hatcher, Otis Gospodnetic and Mike McCandless.</p> -<p>To get started, make sure you get a fresh copy of Mahout from <a href="../developers/buildingmahout.html">SVN</a> +<p>To get started, make sure you get a fresh copy of Mahout from <a href="http://mahout.apache.org/developers/buildingmahout.html">SVN</a> and are comfortable building it. It defines interfaces and implementations for efficiently iterating over a Data Source (it only supports Lucene currently, but should be extensible to databases, Solr, etc.) and produces @@ -267,28 +266,69 @@ in the org.apache.mahout.utils.vectors p several input options, which can be displayed by specifying the --help option. Examples of running the Driver are included below:</p> <p><a name="CreatingVectorsfromText-GeneratinganoutputfilefromaLuceneIndex"></a></p> -<h2 id="generating-an-output-file-from-a-lucene-index">Generating an output file from a Lucene Index</h2> -<div class="codehilite"><pre><span class="p">$</span>MAHOUT_HOME<span class="o">/</span>bin<span class="o">/</span>mahout lucene.vector <span class="o"><</span>PATH TO DIRECTORY CONTAINING LUCENE INDEX<span class="o">></span> - - <span class="o">--</span>output <span class="o"><</span>PATH TO OUTPUT LOCATION<span class="o">></span> - - <span class="o">--</span>field <span class="o"><</span>NAME OF FIELD IN INDEX<span class="o">></span> - - <span class="o">--</span>dictOut <span class="o"><</span>PATH TO FILE TO OUTPUT THE DICTIONARY TO<span class="o">></span> - - <span class="o"><--</span>max <span class="o"><</span>Number of vectors to output<span class="o">>></span> <span class="o"><--</span>norm <span class="p">{</span>INF<span class="o">|</span>integer <span class="o">>=</span> <span class="m">0</span><span class="p">}</span><span class="o">></span> - - <span class="o"><--</span>idField <span class="o"><</span>Name of the idField in the Lucene index<span class="o">>></span> +<h4 id="generating-an-output-file-from-a-lucene-index">Generating an output file from a Lucene Index</h4> +<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span> + <span class="o">--</span><span class="n">dir</span> <span class="p">(</span><span class="o">-</span><span class="n">d</span><span class="p">)</span> <span class="n">dir</span> <span class="n">The</span> <span class="n">Lucene</span> <span class="n">directory</span> + <span class="o">--</span><span class="n">idField</span> <span class="n">idField</span> <span class="n">The</span> <span class="n">field</span> <span class="n">in</span> <span class="n">the</span> <span class="n">index</span> + <span class="n">containing</span> <span class="n">the</span> <span class="n">index</span><span class="p">.</span> <span class="n">If</span> + <span class="n">null</span><span class="p">,</span> <span class="n">then</span> <span class="n">the</span> <span class="n">Lucene</span> + <span class="n">internal</span> <span class="n">doc</span> <span class="n">id</span> <span class="n">is</span> <span class="n">used</span> + <span class="n">which</span> <span class="n">is</span> <span class="n">prone</span> <span class="n">to</span> <span class="n">error</span> + <span class="k">if</span> <span class="n">the</span> <span class="n">underlying</span> <span class="n">index</span> + <span class="n">changes</span> + <span class="o">--</span><span class="n">output</span> <span class="p">(</span><span class="o">-</span><span class="n">o</span><span class="p">)</span> <span class="n">output</span> <span class="n">The</span> <span class="n">output</span> <span class="n">file</span> + <span class="o">--</span><span class="n">delimiter</span> <span class="p">(</span><span class="o">-</span><span class="n">l</span><span class="p">)</span> <span class="n">delimiter</span> <span class="n">The</span> <span class="n">delimiter</span> <span class="k">for</span> + <span class="n">outputting</span> <span class="n">the</span> <span class="n">dictionary</span> + <span class="o">--</span><span class="n">help</span> <span class="p">(</span><span class="o">-</span><span class="n">h</span><span class="p">)</span> <span class="n">Print</span> <span class="n">out</span> <span class="n">help</span> + <span class="o">--</span><span class="n">field</span> <span class="p">(</span><span class="o">-</span><span class="n">f</span><span class="p">)</span> <span class="n">field</span> <span class="n">The</span> <span class="n">field</span> <span class="n">in</span> <span class="n">the</span> <span class="n">index</span> + <span class="o">--</span><span class="n">max</span> <span class="p">(</span><span class="o">-</span><span class="n">m</span><span class="p">)</span> <span class="n">max</span> <span class="n">The</span> <span class="n">maximum</span> <span class="n">number</span> <span class="n">of</span> + <span class="n">vectors</span> <span class="n">to</span> <span class="n">output</span><span class="p">.</span> <span class="n">If</span> + <span class="n">not</span> <span class="n">specified</span><span class="p">,</span> <span class="n">then</span> <span class="n">it</span> + <span class="n">will</span> <span class="n">loop</span> <span class="n">over</span> <span class="n">all</span> <span class="n">docs</span> + <span class="o">--</span><span class="n">dictOut</span> <span class="p">(</span><span class="o">-</span><span class="n">t</span><span class="p">)</span> <span class="n">dictOut</span> <span class="n">The</span> <span class="n">output</span> <span class="n">of</span> <span class="n">the</span> + <span class="n">dictionary</span> + <span class="o">--</span><span class="n">seqDictOut</span> <span class="p">(</span><span class="o">-</span><span class="n">st</span><span class="p">)</span> <span class="n">seqDictOut</span> <span class="n">The</span> <span class="n">output</span> <span class="n">of</span> <span class="n">the</span> + <span class="n">dictionary</span> <span class="n">as</span> <span class="n">sequence</span> + <span class="n">file</span> + <span class="o">--</span><span class="n">norm</span> <span class="p">(</span><span class="o">-</span><span class="n">n</span><span class="p">)</span> <span class="n">norm</span> <span class="n">The</span> <span class="n">norm</span> <span class="n">to</span> <span class="n">use</span><span class="p">,</span> + <span class="n">expressed</span> <span class="n">as</span> <span class="n">either</span> <span class="n">a</span> + <span class="n">double</span> <span class="n">or</span> "<span class="n">INF</span>" <span class="k">if</span> <span class="n">you</span> + <span class="n">want</span> <span class="n">to</span> <span class="n">use</span> <span class="n">the</span> <span class="n">Infinite</span> + <span class="n">norm</span><span class="p">.</span> <span class="n">Must</span> <span class="n">be</span> <span class="n">greater</span> <span class="n">or</span> + <span class="n">equal</span> <span class="n">to</span> 0<span class="p">.</span> <span class="n">The</span> <span class="n">default</span> + <span class="n">is</span> <span class="n">not</span> <span class="n">to</span> <span class="n">normalize</span> + <span class="o">--</span><span class="n">maxDFPercent</span> <span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">)</span> <span class="n">maxDFPercent</span> <span class="n">The</span> <span class="n">max</span> <span class="n">percentage</span> <span class="n">of</span> + <span class="n">docs</span> <span class="k">for</span> <span class="n">the</span> <span class="n">DF</span><span class="p">.</span> <span class="n">Can</span> <span class="n">be</span> + <span class="n">used</span> <span class="n">to</span> <span class="n">remove</span> <span class="n">really</span> + <span class="n">high</span> <span class="n">frequency</span> <span class="n">terms</span><span class="p">.</span> + <span class="n">Expressed</span> <span class="n">as</span> <span class="n">an</span> <span class="n">integer</span> + <span class="n">between</span> 0 <span class="n">and</span> 100<span class="p">.</span> + <span class="n">Default</span> <span class="n">is</span> 99<span class="p">.</span> + <span class="o">--</span><span class="n">weight</span> <span class="p">(</span><span class="o">-</span><span class="n">w</span><span class="p">)</span> <span class="n">weight</span> <span class="n">The</span> <span class="n">kind</span> <span class="n">of</span> <span class="n">weight</span> <span class="n">to</span> + <span class="n">use</span><span class="p">.</span> <span class="n">Currently</span> <span class="n">TF</span> <span class="n">or</span> + <span class="n">TFIDF</span> + <span class="o">--</span><span class="n">minDF</span> <span class="p">(</span><span class="o">-</span><span class="n">md</span><span class="p">)</span> <span class="n">minDF</span> <span class="n">The</span> <span class="n">minimum</span> <span class="n">document</span> + <span class="n">frequency</span><span class="p">.</span> <span class="n">Default</span> <span class="n">is</span> 1 + <span class="o">--</span><span class="n">maxPercentErrorDocs</span> <span class="p">(</span><span class="o">-</span><span class="n">err</span><span class="p">)</span> <span class="n">mErr</span> <span class="n">The</span> <span class="n">max</span> <span class="n">percentage</span> <span class="n">of</span> + <span class="n">docs</span> <span class="n">that</span> <span class="n">can</span> <span class="n">have</span> <span class="n">a</span> <span class="n">null</span> + <span class="n">term</span> <span class="n">vector</span><span class="p">.</span> <span class="n">These</span> <span class="n">are</span> + <span class="n">noise</span> <span class="n">document</span> <span class="n">and</span> <span class="n">can</span> + <span class="n">occur</span> <span class="k">if</span> <span class="n">the</span> <span class="n">analyzer</span> + <span class="n">used</span> <span class="n">strips</span> <span class="n">out</span> <span class="n">all</span> <span class="n">terms</span> + <span class="n">in</span> <span class="n">the</span> <span class="n">target</span> <span class="n">field</span><span class="p">.</span> <span class="n">This</span> + <span class="n">percentage</span> <span class="n">is</span> <span class="n">expressed</span> + <span class="n">as</span> <span class="n">a</span> <span class="n">value</span> <span class="n">between</span> 0 <span class="n">and</span> + 1<span class="p">.</span> <span class="n">The</span> <span class="n">default</span> <span class="n">is</span> 0<span class="p">.</span> </pre></div> -<p><a name="CreatingVectorsfromText-Create50VectorsfromanIndex"></a></p> -<h3 id="create-50-vectors-from-an-index">Create 50 Vectors from an Index</h3> -<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span> <span class="o">--</span><span class="n">dir</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">solr</span><span class="o">/</span><span class="n">data</span><span class="o">/</span><span class="n">index</span> <span class="o">--</span><span class="n">field</span> <span class="n">body</span> - +<h4 id="create-50-vectors-from-an-index">Create 50 Vectors from an Index</h4> +<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span> + <span class="o">--</span><span class="n">dir</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">solr</span><span class="o">/</span><span class="n">data</span><span class="o">/</span><span class="n">index</span> + <span class="o">--</span><span class="n">field</span> <span class="n">body</span> <span class="o">--</span><span class="n">dictOut</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">dict</span><span class="p">.</span><span class="n">txt</span> - - <span class="o">--</span><span class="n">output</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">out</span><span class="p">.</span><span class="n">txt</span> <span class="o">--</span><span class="n">max</span> 50 + <span class="o">--</span><span class="n">output</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">out</span><span class="p">.</span><span class="n">txt</span> + <span class="o">--</span><span class="n">max</span> 50 </pre></div> @@ -296,83 +336,126 @@ option. Examples of running the Driver out the info to the output dir and the dictionary to dict.txt. It only outputs 50 vectors. If you don't specify --max, then all the documents in the index are output.</p> -<p><a name="CreatingVectorsfromText-Normalize50VectorsfromaLuceneIndexusingthe<a href="http://en.wikipedia.org/wiki/Lp_space">L_2Norm</a>"></a></p> -<h3 id="normalize-50-vectors-from-a-lucene-index-using-the-l_2-normhttpenwikipediaorgwikilp_space">Normalize 50 Vectors from a Lucene Index using the [L_2 Norm|http://en.wikipedia.org/wiki/Lp_space]</h3> -<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span> <span class="o">--</span><span class="n">dir</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">solr</span><span class="o">/</span><span class="n">data</span><span class="o">/</span><span class="n">index</span> <span class="o">--</span><span class="n">field</span> <span class="n">body</span> - - <span class="o">--</span><span class="n">dictOut</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">dict</span><span class="p">.</span><span class="n">txt</span> - - <span class="o">--</span><span class="n">output</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">out</span><span class="p">.</span><span class="n">txt</span> <span class="o">--</span><span class="n">max</span> 50 <span class="o">--</span><span class="n">norm</span> 2 +<p><a name="CreatingVectorsfromText-50VectorsFromLuceneL2Norm"></a></p> +<h4 id="creating-50-normalized-vectors-from-a-lucene-index-using-the-l_2-norm">Creating 50 Normalized Vectors from a Lucene Index using the <a href="http://en.wikipedia.org/wiki/Lp_space">L_2 Norm</a></h4> +<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span> + <span class="o">--</span><span class="n">dir</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">solr</span><span class="o">/</span><span class="n">data</span><span class="o">/</span><span class="n">index</span> + <span class="o">--</span><span class="n">field</span> <span class="n">body</span> + <span class="o">--</span><span class="n">dictOut</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">dict</span><span class="p">.</span><span class="n">txt</span> + <span class="o">--</span><span class="n">output</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">out</span><span class="p">.</span><span class="n">txt</span> + <span class="o">--</span><span class="n">max</span> 50 + <span class="o">--</span><span class="n">norm</span> 2 </pre></div> <p><a name="CreatingVectorsfromText-FromDirectoryofTextdocuments"></a></p> -<h1 id="from-directory-of-text-documents">From Directory of Text documents</h1> +<h2 id="from-a-directory-of-text-documents">From A Directory of Text documents</h2> <p>Mahout has utilities to generate Vectors from a directory of text documents. Before creating the vectors, you need to convert the documents to SequenceFile format. SequenceFile is a hadoop class which allows us to write arbitary key,value pairs into it. The DocumentVectorizer requires the key to be a Text with a unique document id, and value to be the Text content in UTF-8 format.</p> -<p>You may find Tika (http://lucene.apache.org/tika) helpful in converting +<p>You may find <a href="http://tika.apache.org/">Tika</a> helpful in converting binary documents to text.</p> <p><a name="CreatingVectorsfromText-ConvertingdirectoryofdocumentstoSequenceFileformat"></a></p> -<h2 id="converting-directory-of-documents-to-sequencefile-format">Converting directory of documents to SequenceFile format</h2> +<h4 id="converting-directory-of-documents-to-sequencefile-format">Converting directory of documents to SequenceFile format</h4> <p>Mahout has a nifty utility which reads a directory path including its sub-directories and creates the SequenceFile in a chunked manner for us. the document id generated is <PREFIX><RELATIVE PATH FROM PARENT>/document.txt</p> -<p>From the examples directory run</p> -<div class="codehilite"><pre><span class="p">$</span>MAHOUT_HOME<span class="o">/</span>bin<span class="o">/</span>mahout seqdirectory - -<span class="o">--</span>input <span class="o"><</span>PARENT DIR WHERE DOCS ARE LOCATED<span class="o">></span> <span class="o">--</span>output <span class="o"><</span>OUTPUT DIRECTORY<span class="o">></span> - -<span class="o"><-</span>c <span class="o"><</span>CHARSET NAME OF THE INPUT DOCUMENTS<span class="o">></span> <span class="p">{</span>UTF<span class="o">-</span><span class="m">8</span><span class="o">|</span>cp1252<span class="o">|</span>ascii...<span class="p">}</span><span class="o">></span> - -<span class="o"><-</span>chunk <span class="o"><</span>MAX SIZE OF EACH CHUNK in Megabytes<span class="o">></span> <span class="m">64</span><span class="o">></span> - -<span class="o"><-</span>prefix <span class="o"><</span>PREFIX TO ADD TO THE DOCUMENT ID<span class="o">>></span> +<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">seqdirectory</span> + <span class="o">--</span><span class="n">input</span> <span class="p">(</span><span class="o">-</span><span class="nb">i</span><span class="p">)</span> <span class="n">input</span> <span class="n">Path</span> <span class="n">to</span> <span class="n">job</span> <span class="n">input</span> <span class="n">directory</span><span class="p">.</span> + <span class="o">--</span><span class="n">output</span> <span class="p">(</span><span class="o">-</span><span class="n">o</span><span class="p">)</span> <span class="n">output</span> <span class="n">The</span> <span class="n">directory</span> <span class="n">pathname</span> <span class="k">for</span> + <span class="n">output</span><span class="p">.</span> + <span class="o">--</span><span class="n">overwrite</span> <span class="p">(</span><span class="o">-</span><span class="n">ow</span><span class="p">)</span> <span class="n">If</span> <span class="n">present</span><span class="p">,</span> <span class="n">overwrite</span> <span class="n">the</span> + <span class="n">output</span> <span class="n">directory</span> <span class="n">before</span> + <span class="n">running</span> <span class="n">job</span> + <span class="o">--</span><span class="n">method</span> <span class="p">(</span><span class="o">-</span><span class="n">xm</span><span class="p">)</span> <span class="n">method</span> <span class="n">The</span> <span class="n">execution</span> <span class="n">method</span> <span class="n">to</span> <span class="n">use</span><span class="p">:</span> + <span class="n">sequential</span> <span class="n">or</span> <span class="n">mapreduce</span><span class="p">.</span> + <span class="n">Default</span> <span class="n">is</span> <span class="n">mapreduce</span> + <span class="o">--</span><span class="n">chunkSize</span> <span class="p">(</span><span class="o">-</span><span class="n">chunk</span><span class="p">)</span> <span class="n">chunkSize</span> <span class="n">The</span> <span class="n">chunkSize</span> <span class="n">in</span> <span class="n">MegaBytes</span><span class="p">.</span> + <span class="n">Defaults</span> <span class="n">to</span> 64 + <span class="o">--</span><span class="n">fileFilterClass</span> <span class="p">(</span><span class="o">-</span><span class="n">filter</span><span class="p">)</span> <span class="n">fFilterClass</span> <span class="n">The</span> <span class="n">name</span> <span class="n">of</span> <span class="n">the</span> <span class="n">class</span> <span class="n">to</span> <span class="n">use</span> + <span class="k">for</span> <span class="n">file</span> <span class="n">parsing</span><span class="p">.</span> <span class="n">Default</span><span class="p">:</span> + <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">mahout</span><span class="p">.</span><span class="n">text</span><span class="p">.</span><span class="n">PrefixAdditionFilter</span> + <span class="o">--</span><span class="n">keyPrefix</span> <span class="p">(</span><span class="o">-</span><span class="n">prefix</span><span class="p">)</span> <span class="n">keyPrefix</span> <span class="n">The</span> <span class="n">prefix</span> <span class="n">to</span> <span class="n">be</span> <span class="n">prepended</span> <span class="n">to</span> + <span class="n">the</span> <span class="n">key</span> + <span class="o">--</span><span class="n">charset</span> <span class="p">(</span><span class="o">-</span><span class="n">c</span><span class="p">)</span> <span class="n">charset</span> <span class="n">The</span> <span class="n">name</span> <span class="n">of</span> <span class="n">the</span> <span class="n">character</span> + <span class="n">encoding</span> <span class="n">of</span> <span class="n">the</span> <span class="n">input</span> <span class="n">files</span><span class="p">.</span> + <span class="n">Default</span> <span class="n">to</span> <span class="n">UTF</span><span class="o">-</span>8 <span class="p">{</span><span class="n">accepts</span><span class="p">:</span> <span class="n">cp1252</span><span class="o">|</span><span class="n">ascii</span><span class="p">...}</span> + <span class="o">--</span><span class="n">method</span> <span class="p">(</span><span class="o">-</span><span class="n">xm</span><span class="p">)</span> <span class="n">method</span> <span class="n">The</span> <span class="n">execution</span> <span class="n">method</span> <span class="n">to</span> <span class="n">use</span><span class="p">:</span> + <span class="n">sequential</span> <span class="n">or</span> <span class="n">mapreduce</span><span class="p">.</span> + <span class="n">Default</span> <span class="n">is</span> <span class="n">mapreduce</span> + <span class="o">--</span><span class="n">overwrite</span> <span class="p">(</span><span class="o">-</span><span class="n">ow</span><span class="p">)</span> <span class="n">If</span> <span class="n">present</span><span class="p">,</span> <span class="n">overwrite</span> <span class="n">the</span> + <span class="n">output</span> <span class="n">directory</span> <span class="n">before</span> + <span class="n">running</span> <span class="n">job</span> + <span class="o">--</span><span class="n">help</span> <span class="p">(</span><span class="o">-</span><span class="n">h</span><span class="p">)</span> <span class="n">Print</span> <span class="n">out</span> <span class="n">help</span> + <span class="o">--</span><span class="n">tempDir</span> <span class="n">tempDir</span> <span class="n">Intermediate</span> <span class="n">output</span> <span class="n">directory</span> + <span class="o">--</span><span class="n">startPhase</span> <span class="n">startPhase</span> <span class="n">First</span> <span class="n">phase</span> <span class="n">to</span> <span class="n">run</span> + <span class="o">--</span><span class="n">endPhase</span> <span class="n">endPhase</span> <span class="n">Last</span> <span class="n">phase</span> <span class="n">to</span> <span class="n">run</span> <span class="o">></span> </pre></div> <p><a name="CreatingVectorsfromText-CreatingVectorsfromSequenceFile"></a></p> -<h2 id="creating-vectors-from-sequencefile">Creating Vectors from SequenceFile</h2> -<p>+<em>Mahout_0.3</em>+</p> +<h4 id="creating-vectors-from-sequencefile">Creating Vectors from SequenceFile</h4> <p>From the sequence file generated from the above step run the following to generate vectors. </p> -<div class="codehilite"><pre><span class="p">$</span>MAHOUT_HOME<span class="o">/</span>bin<span class="o">/</span>mahout seq2sparse - -<span class="o">-</span>i <span class="o"><</span>PATH TO THE SEQUENCEFILES<span class="o">></span> - -<span class="o">-</span>o <span class="o"><</span>OUTPUT DIRECTORY WHERE VECTORS AND DICTIONARY IS GENERATED<span class="o">></span> - -<span class="o"><-</span>wt <span class="o"><</span>WEIGHTING METHOD USED<span class="o">></span> <span class="p">{</span>tf<span class="o">|</span>tfidf<span class="p">}</span><span class="o">></span> - -<span class="o"><-</span>chunk <span class="o"><</span>MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY<span class="o">></span> <span class="m">100</span><span class="o">></span> - -<span class="o"><-</span>a <span class="o"><</span>NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT<span class="o">></span> - -org.apache.lucene.analysis.standard.StandardAnalyzer<span class="o">></span> - -<span class="o"><--</span>minSupport <span class="o"><</span>MINIMUM SUPPORT<span class="o">></span> <span class="m">2</span><span class="o">></span> - -<span class="o"><--</span>minDF <span class="o"><</span>MINIMUM DOCUMENT FREQUENCY<span class="o">></span> <span class="m">1</span><span class="o">></span> - -<span class="o"><--</span>maxDFPercent <span class="o"><</span>MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN <span class="m">0</span><span class="o">-</span><span class="m">100</span><span class="o">></span> <span class="m">99</span><span class="o">></span> - -<span class="o"><--</span>norm <span class="o"><</span>REFER TO L_2 NORM ABOVE<span class="o">></span><span class="p">{</span>INF<span class="o">|</span>integer <span class="o">>=</span> <span class="m">0</span><span class="p">}</span><span class="o">></span><span class="s">"</span> - -<span class="s"><-seq <Create SequentialAccessVectors>{false|true required for running some algorithms(LDA,Lanczos)}>"</span> +<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">seq2sparse</span> + <span class="o">--</span><span class="n">minSupport</span> <span class="p">(</span><span class="o">-</span><span class="n">s</span><span class="p">)</span> <span class="n">minSupport</span> <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">Minimum</span> <span class="n">Support</span><span class="p">.</span> <span class="n">Default</span> + <span class="n">Value</span><span class="p">:</span> 2 + <span class="o">--</span><span class="n">analyzerName</span> <span class="p">(</span><span class="o">-</span><span class="n">a</span><span class="p">)</span> <span class="n">analyzerName</span> <span class="n">The</span> <span class="n">class</span> <span class="n">name</span> <span class="n">of</span> <span class="n">the</span> <span class="n">analyzer</span> + <span class="o">--</span><span class="n">chunkSize</span> <span class="p">(</span><span class="o">-</span><span class="n">chunk</span><span class="p">)</span> <span class="n">chunkSize</span> <span class="n">The</span> <span class="n">chunkSize</span> <span class="n">in</span> <span class="n">MegaBytes</span><span class="p">.</span> <span class="n">Default</span> + <span class="n">Value</span><span class="p">:</span> 100<span class="n">MB</span> + <span class="o">--</span><span class="n">output</span> <span class="p">(</span><span class="o">-</span><span class="n">o</span><span class="p">)</span> <span class="n">output</span> <span class="n">The</span> <span class="n">directory</span> <span class="n">pathname</span> <span class="k">for</span> <span class="n">output</span><span class="p">.</span> + <span class="o">--</span><span class="n">input</span> <span class="p">(</span><span class="o">-</span><span class="nb">i</span><span class="p">)</span> <span class="n">input</span> <span class="n">Path</span> <span class="n">to</span> <span class="n">job</span> <span class="n">input</span> <span class="n">directory</span><span class="p">.</span> + <span class="o">--</span><span class="n">minDF</span> <span class="p">(</span><span class="o">-</span><span class="n">md</span><span class="p">)</span> <span class="n">minDF</span> <span class="n">The</span> <span class="n">minimum</span> <span class="n">document</span> <span class="n">frequency</span><span class="p">.</span> <span class="n">Default</span> + <span class="n">is</span> 1 + <span class="o">--</span><span class="n">maxDFSigma</span> <span class="p">(</span><span class="o">-</span><span class="n">xs</span><span class="p">)</span> <span class="n">maxDFSigma</span> <span class="n">What</span> <span class="n">portion</span> <span class="n">of</span> <span class="n">the</span> <span class="n">tf</span> <span class="p">(</span><span class="n">tf</span><span class="o">-</span><span class="n">idf</span><span class="p">)</span> <span class="n">vectors</span> + <span class="n">to</span> <span class="n">be</span> <span class="n">used</span><span class="p">,</span> <span class="n">expressed</span> <span class="n">in</span> <span class="n">times</span> <span class="n">the</span> + <span class="n">standard</span> <span class="n">deviation</span> <span class="p">(</span><span class="n">sigma</span><span class="p">)</span> <span class="n">of</span> <span class="n">the</span> + <span class="n">document</span> <span class="n">frequencies</span> <span class="n">of</span> <span class="n">these</span> <span class="n">vectors</span><span class="p">.</span> + <span class="n">Can</span> <span class="n">be</span> <span class="n">used</span> <span class="n">to</span> <span class="n">remove</span> <span class="n">really</span> <span class="n">high</span> + <span class="n">frequency</span> <span class="n">terms</span><span class="p">.</span> <span class="n">Expressed</span> <span class="n">as</span> <span class="n">a</span> <span class="n">double</span> + <span class="n">value</span><span class="p">.</span> <span class="n">Good</span> <span class="n">value</span> <span class="n">to</span> <span class="n">be</span> <span class="n">specified</span> <span class="n">is</span> 3<span class="p">.</span>0<span class="p">.</span> + <span class="n">In</span> <span class="k">case</span> <span class="n">the</span> <span class="n">value</span> <span class="n">is</span> <span class="n">less</span> <span class="n">than</span> 0 <span class="n">no</span> + <span class="n">vectors</span> <span class="n">will</span> <span class="n">be</span> <span class="n">filtered</span> <span class="n">out</span><span class="p">.</span> <span class="n">Default</span> <span class="n">is</span> + <span class="o">-</span>1<span class="p">.</span>0<span class="p">.</span> <span class="n">Overrides</span> <span class="n">maxDFPercent</span> + <span class="o">--</span><span class="n">maxDFPercent</span> <span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">)</span> <span class="n">maxDFPercent</span> <span class="n">The</span> <span class="n">max</span> <span class="n">percentage</span> <span class="n">of</span> <span class="n">docs</span> <span class="k">for</span> <span class="n">the</span> <span class="n">DF</span><span class="p">.</span> + <span class="n">Can</span> <span class="n">be</span> <span class="n">used</span> <span class="n">to</span> <span class="n">remove</span> <span class="n">really</span> <span class="n">high</span> + <span class="n">frequency</span> <span class="n">terms</span><span class="p">.</span> <span class="n">Expressed</span> <span class="n">as</span> <span class="n">an</span> <span class="n">integer</span> + <span class="n">between</span> 0 <span class="n">and</span> 100<span class="p">.</span> <span class="n">Default</span> <span class="n">is</span> 99<span class="p">.</span> <span class="n">If</span> + <span class="n">maxDFSigma</span> <span class="n">is</span> <span class="n">also</span> <span class="n">set</span><span class="p">,</span> <span class="n">it</span> <span class="n">will</span> <span class="n">override</span> + <span class="n">this</span> <span class="n">value</span><span class="p">.</span> + <span class="o">--</span><span class="n">weight</span> <span class="p">(</span><span class="o">-</span><span class="n">wt</span><span class="p">)</span> <span class="n">weight</span> <span class="n">The</span> <span class="n">kind</span> <span class="n">of</span> <span class="n">weight</span> <span class="n">to</span> <span class="n">use</span><span class="p">.</span> <span class="n">Currently</span> <span class="n">TF</span> + <span class="n">or</span> <span class="n">TFIDF</span><span class="p">.</span> <span class="n">Default</span><span class="p">:</span> <span class="n">TFIDF</span> + <span class="o">--</span><span class="n">norm</span> <span class="p">(</span><span class="o">-</span><span class="n">n</span><span class="p">)</span> <span class="n">norm</span> <span class="n">The</span> <span class="n">norm</span> <span class="n">to</span> <span class="n">use</span><span class="p">,</span> <span class="n">expressed</span> <span class="n">as</span> <span class="n">either</span> <span class="n">a</span> + <span class="n">float</span> <span class="n">or</span> "<span class="n">INF</span>" <span class="k">if</span> <span class="n">you</span> <span class="n">want</span> <span class="n">to</span> <span class="n">use</span> <span class="n">the</span> + <span class="n">Infinite</span> <span class="n">norm</span><span class="p">.</span> <span class="n">Must</span> <span class="n">be</span> <span class="n">greater</span> <span class="n">or</span> <span class="n">equal</span> + <span class="n">to</span> 0<span class="p">.</span> <span class="n">The</span> <span class="n">default</span> <span class="n">is</span> <span class="n">not</span> <span class="n">to</span> <span class="n">normalize</span> + <span class="o">--</span><span class="n">minLLR</span> <span class="p">(</span><span class="o">-</span><span class="n">ml</span><span class="p">)</span> <span class="n">minLLR</span> <span class="p">(</span><span class="n">Optional</span><span class="p">)</span><span class="n">The</span> <span class="n">minimum</span> <span class="n">Log</span> <span class="n">Likelihood</span> + <span class="n">Ratio</span><span class="p">(</span><span class="n">Float</span><span class="p">)</span> <span class="n">Default</span> <span class="n">is</span> 1<span class="p">.</span>0 + <span class="o">--</span><span class="n">numReducers</span> <span class="p">(</span><span class="o">-</span><span class="n">nr</span><span class="p">)</span> <span class="n">numReducers</span> <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">Number</span> <span class="n">of</span> <span class="n">reduce</span> <span class="n">tasks</span><span class="p">.</span> + <span class="n">Default</span> <span class="n">Value</span><span class="p">:</span> 1 + <span class="o">--</span><span class="n">maxNGramSize</span> <span class="p">(</span><span class="o">-</span><span class="n">ng</span><span class="p">)</span> <span class="n">ngramSize</span> <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">The</span> <span class="n">maximum</span> <span class="nb">size</span> <span class="n">of</span> <span class="n">ngrams</span> <span class="n">to</span> + <span class="n">create</span> <span class="p">(</span>2 <span class="p">=</span> <span class="n">bigrams</span><span class="p">,</span> 3 <span class="p">=</span> <span class="n">trigrams</span><span class="p">,</span> <span class="n">etc</span><span class="p">)</span> + <span class="n">Default</span> <span class="n">Value</span><span class="p">:</span>1 + <span class="o">--</span><span class="n">overwrite</span> <span class="p">(</span><span class="o">-</span><span class="n">ow</span><span class="p">)</span> <span class="n">If</span> <span class="n">set</span><span class="p">,</span> <span class="n">overwrite</span> <span class="n">the</span> <span class="n">output</span> <span class="n">directory</span> + <span class="o">--</span><span class="n">help</span> <span class="p">(</span><span class="o">-</span><span class="n">h</span><span class="p">)</span> <span class="n">Print</span> <span class="n">out</span> <span class="n">help</span> + <span class="o">--</span><span class="n">sequentialAccessVector</span> <span class="p">(</span><span class="o">-</span><span class="n">seq</span><span class="p">)</span> <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">Whether</span> <span class="n">output</span> <span class="n">vectors</span> <span class="n">should</span> + <span class="n">be</span> <span class="n">SequentialAccessVectors</span><span class="p">.</span> <span class="n">Default</span> <span class="n">is</span> <span class="n">false</span><span class="p">;</span> + <span class="n">true</span> <span class="n">required</span> <span class="k">for</span> <span class="n">running</span> <span class="n">some</span> <span class="n">algorithms</span> + <span class="p">(</span><span class="n">LDA</span><span class="p">,</span><span class="n">Lanczos</span><span class="p">)</span> + <span class="o">--</span><span class="n">namedVector</span> <span class="p">(</span><span class="o">-</span><span class="n">nv</span><span class="p">)</span> <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">Whether</span> <span class="n">output</span> <span class="n">vectors</span> <span class="n">should</span> + <span class="n">be</span> <span class="n">NamedVectors</span><span class="p">.</span> <span class="n">If</span> <span class="n">set</span> <span class="n">true</span> <span class="k">else</span> <span class="n">false</span> + <span class="o">--</span><span class="n">logNormalize</span> <span class="p">(</span><span class="o">-</span><span class="n">lnorm</span><span class="p">)</span> <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">Whether</span> <span class="n">output</span> <span class="n">vectors</span> <span class="n">should</span> + <span class="n">be</span> <span class="n">logNormalize</span><span class="p">.</span> <span class="n">If</span> <span class="n">set</span> <span class="n">true</span> <span class="k">else</span> <span class="n">false</span> </pre></div> -<p>--minSupport is the min frequency for the word to be considered as a -feature. --minDF is the min number of documents the word needs to be in ---maxDFPercent is the max value of the expression (document frequency of a -word/total number of document) to be considered as good feature to be in -the document. This helps remove high frequency features like stop words</p> -<p><a name="CreatingVectorsfromText-Background"></a></p> -<h1 id="background">Background</h1> +<p>--minSupport is the min frequency for the word to be considered as a feature. --minDF is the min number of documents the word needs to be in --maxDFPercent is the max value of the expression (document frequency of a word/total number of document) to be considered as good feature to be in the document. These options are helpful in removing high frequency features like stop words. +<a name="CreatingVectorsfromText-Background"></a></p> +<h2 id="background">Background</h2> <ul> <li><a href="http://markmail.org/thread/l5zi3yk446goll3o">Discussion on centroid calculations with sparse vectors</a></li> </ul>