protwords.txt support in stemmers

2010-03-30 Thread Robert Muir
Hello Solr devs, One thing we did recently in lucene that I would like to expose in Solr, is add support for protected words to all stemmers. So the way this works is that a TokenStream attribute 'KeywordAttribute' is set, and all the stemfilters know to ignore tokens with this boolean value

Re: protwords.txt support in stemmers

2010-03-30 Thread Yonik Seeley
On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir rcm...@gmail.com wrote: We have two choices: * we could treat this stuff as impl details, and add protwords.txt support to all stemming factories. we could just wrap the filter with a keywordmarkerfilter internally. * we could deprecate the

Re: protwords.txt support in stemmers

2010-03-30 Thread Robert Muir
On Tue, Mar 30, 2010 at 8:33 AM, Yonik Seeley yo...@lucidimagination.comwrote: It would also be nice to make the token categories generated by tokenizers into tags (like StandardTokenizer's ACRONYM, etc). A tokenizer that detected many of the properties could significantly speed up analysis

Re: protwords.txt support in stemmers

2010-03-30 Thread Robert Muir
On Tue, Mar 30, 2010 at 8:33 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir rcm...@gmail.com wrote: We have two choices: * we could treat this stuff as impl details, and add protwords.txt support to all stemming factories. we could just wrap

Re: protwords.txt support in stemmers

2010-03-30 Thread Yonik Seeley
On Tue, Mar 30, 2010 at 10:07 AM, Robert Muir rcm...@gmail.com wrote: Sorta unrelated too, but on the same topic of performance, I'd really like to improve the indexing speed with the example schema, and thats my hidden motivation here. I think we've already significantly improved WDF and

Re: protwords.txt support in stemmers

2010-03-30 Thread Robert Muir
On Tue, Mar 30, 2010 at 10:32 AM, Yonik Seeley yo...@lucidimagination.comwrote: Unfortunately not... it's normally something ad hoc like uploading a big CSV file, etc. There's also the very simplistic TestIndexingPerformance. ant test -Dtestcase=TestIndexingPerformance -Dargs=-server

Re: git at apache

2010-03-30 Thread David Smiley (@MITRE.org)
It absolutely is a better way to collaborate on development, especially in conjunction with github: http://github.com/apache/solr HOWEVER, the merge of Lucene Solr has totally disrupted the git mirrors. Who can fix this? ~ David Smiley - Author:

Re: git at apache

2010-03-30 Thread Yonik Seeley
I've opened an issue for this: https://issues.apache.org/jira/browse/INFRA-2580 -Yonik http://www.lucidimagination.com On Tue, Mar 30, 2010 at 11:27 AM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: It absolutely is a better way to collaborate on development, especially in conjunction

[jira] Created: (SOLR-1855) Script to monitor Solr health including replication status

2010-03-30 Thread Shawn Smith (JIRA)
Script to monitor Solr health including replication status -- Key: SOLR-1855 URL: https://issues.apache.org/jira/browse/SOLR-1855 Project: Solr Issue Type: New Feature

[jira] Updated: (SOLR-1855) Script to monitor Solr health including replication status

2010-03-30 Thread Shawn Smith (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shawn Smith updated SOLR-1855: -- Attachment: checksolr I've attached a first pass implementation of this script: !checksolr!. It's

Re: protwords.txt support in stemmers

2010-03-30 Thread Grant Ingersoll
On Mar 30, 2010, at 8:33 AM, Yonik Seeley wrote: On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir rcm...@gmail.com wrote: We have two choices: * we could treat this stuff as impl details, and add protwords.txt support to all stemming factories. we could just wrap the filter with a

[jira] Issue Comment Edited: (SOLR-1855) Script to monitor Solr health including replication status

2010-03-30 Thread Shawn Smith (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851462#action_12851462 ] Shawn Smith edited comment on SOLR-1855 at 3/30/10 9:58 PM: I've

[jira] Commented: (SOLR-1375) BloomFilter on a field

2010-03-30 Thread Jason Rutherglen (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851637#action_12851637 ] Jason Rutherglen commented on SOLR-1375: {quote}Doesn't this hint at some of this

[jira] Created: (SOLR-1856) In Solr Cell, literals should override Tika-parsed values

2010-03-30 Thread Chris Harris (JIRA)
In Solr Cell, literals should override Tika-parsed values - Key: SOLR-1856 URL: https://issues.apache.org/jira/browse/SOLR-1856 Project: Solr Issue Type: Improvement

[jira] Updated: (SOLR-1856) In Solr Cell, literals should override Tika-parsed values

2010-03-30 Thread Chris Harris (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Harris updated SOLR-1856: --- Attachment: SOLR-1856.patch Initial patch. Notes: * We allow literal values to override all other

[jira] Commented: (SOLR-1633) Solr Cell should be smarter about literal and multiValued=false

2010-03-30 Thread Chris Harris (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851667#action_12851667 ] Chris Harris commented on SOLR-1633: bq. It seems like a possible improvement here would

[jira] Updated: (SOLR-1856) In Solr Cell, literals should override Tika-parsed values

2010-03-30 Thread Chris Harris (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Harris updated SOLR-1856: --- Description: I propose that ExtractingRequestHandler / SolrCell literals should take precedence over

[jira] Closed: (SOLR-1803) ExtractingRequestHandler does not propagate multiple values to a multi-valued field

2010-03-30 Thread Lance Norskog (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Norskog closed SOLR-1803. --- Resolution: Fixed 3 other issues go after this same problem - probably SOLR-1856 will win the turtle

[jira] Commented: (SOLR-1553) extended dismax query parser

2010-03-30 Thread Jonathan Rochkind (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851700#action_12851700 ] Jonathan Rochkind commented on SOLR-1553: - Hoss, I would be EXTREMELY interested in

[jira] Commented: (SOLR-1842) DataImportHandler ODBC keeps lock on the source table while optimisatising is being run...

2010-03-30 Thread Lance Norskog (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851722#action_12851722 ] Lance Norskog commented on SOLR-1842: - Could the DIH shut down all Datasources

[jira] Commented: (SOLR-1848) Add example Query page to the example

2010-03-30 Thread Lance Norskog (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851725#action_12851725 ] Lance Norskog commented on SOLR-1848: - Maybe there could be a Solr Apps project

[jira] Commented: (SOLR-1568) Implement Spatial Filter

2010-03-30 Thread Lance Norskog (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851726#action_12851726 ] Lance Norskog commented on SOLR-1568: - Dublin Core includes conventions for encoding