Re: Stopwords work for Solr but not for Mahout

Grant Ingersoll Sat, 02 Jan 2010 13:51:07 -0800

I should note, I am still validating the quality of the results and that the 
DIH stuff is just a sample of all the feeds I'm using.


On Jan 2, 2010, at 3:56 PM, Grant Ingersoll wrote:

> 
> On Jan 2, 2010, at 3:11 PM, Drew Farris wrote:
> 
>> I've managed to get k-means clustering working, but I agree it would be very
>> nice to have an end-to-end example that would allow others to get up to
>> speed quickly. I think the largest holes here are related to the vacuum of a
>> corpus of text into the Lucene index and the presentation of a
>> human-readable display of the results. It might be interesting to also
>> calculate and include some metrics such as the F-measure (in cases where we
>> have a reference categorization) and scatter score (in cases where we
>> don't).
>> 
>> The existing LDA example would be a useful starting point. It slurps
>> in the Reuters-21578
>> corpus <http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html>,
>> converts it to text, loads it into a Lucene index, extracts vectors from the
>> lucene index and runs LDA upon them.
>> 
>> This example uses the lucene benchmark utilities for the input to text
>> conversion and lucene loading. The benchmark utilities code is readable but
>> complex. It would be very nice to have a simple piece of code to handle the
>> creation of the Lucene index that others can easilly build upon to respond
>> to their existing corpus.
>> 
> 
> 
> +1.
> 
> I've also got this working for a bunch of RSS feeds using Solr's 
> DataImportHandler and the following commands:
> 
> In Solr, I setup the DataImportHandler with something like:
> <dataConfig>
> 
> <dataSource name="rss" type="HttpDataSource" encoding="UTF-8"/>
>       <document>
>   <!-- New York Times Sports feed -->
>               <entity name="nytSportsFeed"
>                               pk="link"
>                               url="http://feeds1.nytimes.com/nyt/rss/Sports";
>                               processor="XPathEntityProcessor"
>                               forEach="/rss/channel | /rss/channel/item"
>           dataSource="rss"
>       transformer="RegexTransformer,DateFormatTransformer">
>                       <field column="source" xpath="/rss/channel/title" 
> commonField="true" />
>                       <field column="source-link" xpath="/rss/channel/link" 
> commonField="true" />
>                       <field column="title" xpath="/rss/channel/item/title" />
>                       <field column="id" xpath="/rss/channel/item/guid" />
>                       <field column="link" xpath="/rss/channel/item/link" />
>     <!-- Use the RegexTransformer to strip out ads -->
>                       <field column="description" 
> xpath="/rss/channel/item/description" regex="&lt;a.*?&lt;/a&gt;" 
> replaceWith=""/>
>                       <field column="category" 
> xpath="/rss/channel/item/category" />
>     <!-- 'Sun, 18 May 2008 11:23:11 +0000' -->
>     <field column="pubDate" xpath="/rss/channel/item/pubDate" 
> dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
>   </entity>
>   <entity name="nytWorld"
>                               pk="link"
>                               url="http://feeds.nytimes.com/nyt/rss/World";
>                               processor="XPathEntityProcessor"
>                               forEach="/rss/channel | /rss/channel/item"
>           dataSource="rss"
>       transformer="RegexTransformer,DateFormatTransformer">
>                       <field column="source" xpath="/rss/channel/title" 
> commonField="true" />
>                       <field column="source-link" xpath="/rss/channel/link" 
> commonField="true" />
>                       <field column="title" xpath="/rss/channel/item/title" />
>                       <field column="id" xpath="/rss/channel/item/guid" />
>                       <field column="link" xpath="/rss/channel/item/link" />
>     <!-- Use the RegexTransformer to strip out ads -->
>                       <field column="description" 
> xpath="/rss/channel/item/description" regex="&lt;a.*?&lt;/a&gt;" 
> replaceWith=""/>
>                       <field column="category" 
> xpath="/rss/channel/item/category" />
>     <!-- 'Sun, 18 May 2008 11:23:11 +0000' -->
>     <field column="pubDate" xpath="/rss/channel/item/pubDate" 
> dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
>   </entity>
> 
> </dataConfig>
> 
> Then in my browser: 
> http://localhost:8983/solr/dataimport?command=full-import&clean=true
> 
> Then on the command line in Mahout home:
>> mvn dependecy:copy-dependencies
>> cd target/dependency
>> java -cp "*" org.apache.mahout.utils.vectors.lucene.Driver --dir [path to 
>> index]/data/index/ --output ./solr-clust-n2/part-out.vec --field 
>> desc-clustering --idField id --dictOut ./solr-clust-n2/dictionary.txt --norm 
>> 2
>> java -Xmx1024M -cp "*" org.apache.mahout.clustering.kmeans.KMeansDriver 
>> --input ./solr-clust-n2/part-out.vec --clusters ./solr-clust-n2/out/clusters 
>>  --output ./solr-clust-n2/out/ --distance 
>> org.apache.mahout.common.distance.CosineDistanceMeasure --convergence 0.001 
>> --overwrite --k 25
>> java -Xmx1024M -cp "*" org.apache.mahout.utils.clustering.ClusterDumper 
>> --seqFileDir ./solr-clust-n2/out/clusters-2  --dictionary 
>> ./solr-clust-n2/dictionary.txt  --substring 100 --pointsDir 
>> ./solr-clust-n2/out/points/
> or:
>> java -Xmx1024M -cp "*" org.apache.mahout.utils.vectors.lucene.ClusterLabels 
>> --dir [path to index]/data/index/ --field description --idField id 
>> --seqFileDir ./solr-clust-n2/out/clusters-2  --pointsDir 
>> ./solr-clust-n2/out/points/ --minClusterSize 5 --maxLabels 10
> 
> 
> -Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Stopwords work for Solr but not for Mahout

Reply via email to