Hi David,

Just curious about your use of the HathiTrust list.  I usually explain to 
people that it's customized to our index and they are probably better off 
making their own list based on the lists of stop words appropriate for the 
languages in their index (sources listed in the blog post 
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance)  
If you already have an index built and are re-indexing with CommonGrams , you 
can also use the -t flag with HighFreqTerms.java in lucene contrib to determine 
the words that have the largest position lists and are therefore candidates to 
be added to your CommonGrams word list.  We recently ran HighFreqTerms.java 
against our indexes and discovered that it would be better to remove some of 
the less frequent foreign language stopwords and instead use some very frequent 
words from the index.

Tom Burton-West
www.hathitrust.org/blogs
________________________________________
From: Steven Rowe (JIRA) [j...@apache.org]
Sent: Monday, June 06, 2011 2:08 PM
To: dev@lucene.apache.org
Subject: [jira] [Resolved] (SOLR-1844) CommonGramsQueryFilterFactory should 
read words in a comma-delimited format

     [ 
https://issues.apache.org/jira/browse/SOLR-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe resolved SOLR-1844.
-------------------------------

    Resolution: Won't Fix
      Assignee: Steven Rowe

Thanks David.

> CommonGramsQueryFilterFactory should read words in a comma-delimited format
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-1844
>                 URL: https://issues.apache.org/jira/browse/SOLR-1844
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: David Smiley
>            Assignee: Steven Rowe
>            Priority: Minor
>
> CommonGramsQueryFilterFactory expects that the file(s) given to the "words" 
> argument is a carriage-return delimited list of words.  It doesn't support 
> comments either.  This file format should be more flexible to support comma 
> delimited values.  I came across this because I was trying to use the sample 
> file provided by HathiTrust:
> http://www.hathitrust.org/node/180    (named in a file new400common.txt)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to