Re: GIT does not support empty directories

2010-04-16 Thread Robert Muir
Seriously? We should hack our ant files around the bugs in every crappy
source control system that comes out?

Fix Git.

On Thu, Apr 15, 2010 at 10:55 PM, Smiley, David W. dsmi...@mitre.orgwrote:

 I've run into this too.  I don't think this needs to be documented, I think
 it needs to be *fixed* -- that is, the relevant ant tasks need to not assume
 these directories exist and create them if not.

 ~ David Smiley

 -Original Message-
 From: Lance Norskog [mailto:goks...@gmail.com]
 Sent: Wednesday, April 14, 2010 11:14 PM
 To: solr-dev
 Subject: GIT does not support empty directories

 There are some empty directories in the Solr source tree, both in 1.4
 and the trunk.

 example/work
 example/webapp
 example/logs

 Git does not support empty directories:

 https://git.wiki.kernel.org/index.php/GitFaq#Can_I_add_empty_directories.3F

 And so, when you check out from the Apache GIT repository, these empty
 directories do not appear and 'ant example' and 'ant run-example'
 fail. There is no 'how to use the solr git stuff' wiki page; that
 seems like the right place to document this. I'm not git-smart enough
 to write that page.
 --
 Lance Norskog
 goks...@gmail.com




-- 
Robert Muir
rcm...@gmail.com


Re: GIT does not support empty directories

2010-04-16 Thread Robert Muir
I don't like the idea of complicating lucene/solr's build system any more
than it already is, unless its absolutely necessary. its already too
complicated.

Instead of adding more hacks, what is actually broken (git) is what should
be fixed, as the link states:

Currently the design of the git index (staging area) only permits *files* to
be listed, and nobody competent enough to make the change to allow empty
directories has cared enough about this situation to remedy it.

On Fri, Apr 16, 2010 at 11:14 AM, Smiley, David W. dsmi...@mitre.orgwrote:

 Seriously.
 I sympathize with your point that git should support empty directories.
  But as a practical matter, it's easy to make the ant build tolerant of
 them.

 ~ David Smiley
 
 From: Robert Muir [rcm...@gmail.com]
 Sent: Friday, April 16, 2010 6:53 AM
 To: solr-dev@lucene.apache.org
 Subject: Re: GIT does not support empty directories

 Seriously? We should hack our ant files around the bugs in every crappy
 source control system that comes out?

 Fix Git.

 On Thu, Apr 15, 2010 at 10:55 PM, Smiley, David W. dsmi...@mitre.org
 wrote:

  I've run into this too.  I don't think this needs to be documented, I
 think
  it needs to be *fixed* -- that is, the relevant ant tasks need to not
 assume
  these directories exist and create them if not.
 
  ~ David Smiley
 
  -Original Message-
  From: Lance Norskog [mailto:goks...@gmail.com]
  Sent: Wednesday, April 14, 2010 11:14 PM
  To: solr-dev
  Subject: GIT does not support empty directories
 
  There are some empty directories in the Solr source tree, both in 1.4
  and the trunk.
 
  example/work
  example/webapp
  example/logs
 
  Git does not support empty directories:
 
 
 https://git.wiki.kernel.org/index.php/GitFaq#Can_I_add_empty_directories.3F
 
  And so, when you check out from the Apache GIT repository, these empty
  directories do not appear and 'ant example' and 'ant run-example'
  fail. There is no 'how to use the solr git stuff' wiki page; that
  seems like the right place to document this. I'm not git-smart enough
  to write that page.
  --
  Lance Norskog
  goks...@gmail.com
 



 --
 Robert Muir
 rcm...@gmail.com




-- 
Robert Muir
rcm...@gmail.com


Re: Eclipse project files...

2010-04-13 Thread Robert Muir
On Mon, Apr 12, 2010 at 5:15 AM, Paolo Castagna 
castagna.li...@googlemail.com wrote:


 For Lucene, I needed two more jars from Ant project:

  - ant-1.7.1.jar
  - ant-junit-1.7.1.jar


Paolo, I put these in the lib directory now, to hopefully make IDE
configuration easier.

By the way, thanks for your ideas here. I think its worth our time to try to
make Lucene/Solr as easy as possible for someone to bring up in their IDE,
or we scare people away...


-- 
Robert Muir
rcm...@gmail.com


[jira] Assigned: (SOLR-1876) Convert all tokenstreams and tests to use CharTermAttribute

2010-04-11 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned SOLR-1876:
-

Assignee: Robert Muir

 Convert all tokenstreams and tests to use CharTermAttribute
 ---

 Key: SOLR-1876
 URL: https://issues.apache.org/jira/browse/SOLR-1876
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1876.patch


 See the improvements in LUCENE-2302.
 TermAttribute has been deprecated for flexible indexing, as terms can really 
 be anything, as long as they can
 be serialized to byte[]. 
 For character-terms, a CharTermAttribute has been created, with a more 
 friendly API. Additionally this attribute
 implements the CharSequence and Appendable interfaces.
 We should convert all Solr tokenstreams to use this new attribute.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Assigned: (SOLR-1874) optimize patternreplacefilter

2010-04-10 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned SOLR-1874:
-

Assignee: Robert Muir

 optimize patternreplacefilter
 -

 Key: SOLR-1874
 URL: https://issues.apache.org/jira/browse/SOLR-1874
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1874.patch


 We can optimize PatternReplaceFilter:
 * don't need to create Strings since CharTermAttribute implements 
 CharSequence, just match directly against it.
 * reuse the matcher, since CharTermAttribute is reused, too.
 * don't create Strings/waste time in replaceAll/replaceFirst if the term 
 doesn't match the regex at all... check with find() first.
 There is more that could be done to make it faster for terms that do match, 
 but this is simple and a start.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Resolved: (SOLR-1874) optimize patternreplacefilter

2010-04-10 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-1874.
---

Resolution: Fixed

Committed revision 932752.

 optimize patternreplacefilter
 -

 Key: SOLR-1874
 URL: https://issues.apache.org/jira/browse/SOLR-1874
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1874.patch


 We can optimize PatternReplaceFilter:
 * don't need to create Strings since CharTermAttribute implements 
 CharSequence, just match directly against it.
 * reuse the matcher, since CharTermAttribute is reused, too.
 * don't create Strings/waste time in replaceAll/replaceFirst if the term 
 doesn't match the regex at all... check with find() first.
 There is more that could be done to make it faster for terms that do match, 
 but this is simple and a start.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (SOLR-1876) Convert all tokenstreams and tests to use CharTermAttribute

2010-04-10 Thread Robert Muir (JIRA)
Convert all tokenstreams and tests to use CharTermAttribute
---

 Key: SOLR-1876
 URL: https://issues.apache.org/jira/browse/SOLR-1876
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
 Fix For: 3.1


See the improvements in LUCENE-2302.

TermAttribute has been deprecated for flexible indexing, as terms can really be 
anything, as long as they can
be serialized to byte[]. 

For character-terms, a CharTermAttribute has been created, with a more friendly 
API. Additionally this attribute
implements the CharSequence and Appendable interfaces.

We should convert all Solr tokenstreams to use this new attribute.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (SOLR-1876) Convert all tokenstreams and tests to use CharTermAttribute

2010-04-10 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1876:
--

Attachment: SOLR-1876.patch

This patch does the following:
* Converts all tokenstreams to use CharTermAttribute
* Makes all non-final concrete TokenStreams and Analyzers final (see 
LUCENE-2389)
* enables both lucene and solr assertions when running solr core and contrib 
tests (previously disabled!)

All tests pass, and also pass with the additional assertions if you apply 
LUCENE-2389

 Convert all tokenstreams and tests to use CharTermAttribute
 ---

 Key: SOLR-1876
 URL: https://issues.apache.org/jira/browse/SOLR-1876
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1876.patch


 See the improvements in LUCENE-2302.
 TermAttribute has been deprecated for flexible indexing, as terms can really 
 be anything, as long as they can
 be serialized to byte[]. 
 For character-terms, a CharTermAttribute has been created, with a more 
 friendly API. Additionally this attribute
 implements the CharSequence and Appendable interfaces.
 We should convert all Solr tokenstreams to use this new attribute.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (SOLR-1874) optimize patternreplacefilter

2010-04-09 Thread Robert Muir (JIRA)
optimize patternreplacefilter
-

 Key: SOLR-1874
 URL: https://issues.apache.org/jira/browse/SOLR-1874
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
 Fix For: 3.1
 Attachments: SOLR-1874.patch

We can optimize PatternReplaceFilter:
* don't need to create Strings since CharTermAttribute implements CharSequence, 
just match directly against it.
* reuse the matcher, since CharTermAttribute is reused, too.
* don't create Strings/waste time in replaceAll/replaceFirst if the term 
doesn't match the regex at all... check with find() first.

There is more that could be done to make it faster for terms that do match, but 
this is simple and a start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1874) optimize patternreplacefilter

2010-04-09 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1874:
--

Attachment: SOLR-1874.patch

 optimize patternreplacefilter
 -

 Key: SOLR-1874
 URL: https://issues.apache.org/jira/browse/SOLR-1874
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1874.patch


 We can optimize PatternReplaceFilter:
 * don't need to create Strings since CharTermAttribute implements 
 CharSequence, just match directly against it.
 * reuse the matcher, since CharTermAttribute is reused, too.
 * don't create Strings/waste time in replaceAll/replaceFirst if the term 
 doesn't match the regex at all... check with find() first.
 There is more that could be done to make it faster for terms that do match, 
 but this is simple and a start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1869) RemoveDuplicatesTokenFilter doest have expected behaviour

2010-04-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854983#action_12854983
 ] 

Robert Muir commented on SOLR-1869:
---

bq. this all started because the highlighter was highlighting a term at the 
same offsets twice,

Perhaps we should fix this directly in DefaultSolrHighlighter? It already has 
this TokenStream-sorting filter thats intended to do the following:
{code}
/** Orders Tokens in a window first by their startOffset ascending.
 * endOffset is currently ignored.
 * This is meant to work around fickleness in the highlighter only.  It
 * can mess up token positions and should not be used for indexing or querying.
 */
{code}

Maybe the deduplication logic should occur here after it sorts on startOffset? 


 RemoveDuplicatesTokenFilter doest have expected behaviour
 -

 Key: SOLR-1869
 URL: https://issues.apache.org/jira/browse/SOLR-1869
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Joe Calderon
Priority: Minor
 Attachments: RemoveDupOffsetTokenFilter.java, 
 RemoveDupOffsetTokenFilterFactory.java, SOLR-1869.patch


 the RemoveDuplicatesTokenFilter seems broken as it initializes its map and 
 attributes at the class level and not within its constructor
 in addition i would think the expected behaviour would be to remove identical 
 terms with the same offset positions, instead it looks like it removes 
 duplicates based on position increment which wont work when using it after 
 something like the edgengram filter. when i posted this to the mailing list 
 even erik hatcher seemed to think thats what this filter was supposed to do...
 attaching a patch that has the expected behaviour and initializes variables 
 in constructor

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1869) RemoveDuplicatesTokenFilter doest have expected behaviour

2010-04-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854676#action_12854676
 ] 

Robert Muir commented on SOLR-1869:
---

Joe, the initialization is the same. I simply prefer to do this right where the 
attribute is declared, rather than doing it in the ctor (its the same in 
java!). So this is no problem.

as far as the behavior, the filter is currently correct:
{noformat}
A TokenFilter which filters out Tokens at the same position and Term text as 
the previous token in the stream.
{noformat}

if you want to instead create a filter that removes duplicates across an entire 
field, this is really a completely different filter, but it sounds like a 
useful completely different filter!

Can you instead create a patch for a separate filter with a different name?

I think you can start with this patch, but there are a number of issues with 
this patch though:
* the map/set is never cleared, so it won't work across reusable tokenstreams. 
The map/set should be cleared in reset()
* i would use chararrayset instead of this map, like the current 
RemoveDuplicatesTokenFilter


 RemoveDuplicatesTokenFilter doest have expected behaviour
 -

 Key: SOLR-1869
 URL: https://issues.apache.org/jira/browse/SOLR-1869
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Reporter: Joe Calderon
Priority: Minor
 Attachments: SOLR-1869.patch


 the RemoveDuplicatesTokenFilter seems broken as it initializes its map and 
 attributes at the class level and not within its constructor
 in addition i would think the expected behaviour would be to remove identical 
 terms with the same offset positions, instead it looks like it removes 
 duplicates based on position increment which wont work when using it after 
 something like the edgengram filter. when i posted this to the mailing list 
 even erik hatcher seemed to think thats what this filter was supposed to do...
 attaching a patch that has the expected behaviour and initializes variables 
 in constructor

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1869) RemoveDuplicatesTokenFilter doest have expected behaviour

2010-04-07 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1869:
--

Issue Type: New Feature  (was: Bug)

 RemoveDuplicatesTokenFilter doest have expected behaviour
 -

 Key: SOLR-1869
 URL: https://issues.apache.org/jira/browse/SOLR-1869
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Joe Calderon
Priority: Minor
 Attachments: SOLR-1869.patch


 the RemoveDuplicatesTokenFilter seems broken as it initializes its map and 
 attributes at the class level and not within its constructor
 in addition i would think the expected behaviour would be to remove identical 
 terms with the same offset positions, instead it looks like it removes 
 duplicates based on position increment which wont work when using it after 
 something like the edgengram filter. when i posted this to the mailing list 
 even erik hatcher seemed to think thats what this filter was supposed to do...
 attaching a patch that has the expected behaviour and initializes variables 
 in constructor

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1869) RemoveDuplicatesTokenFilter doest have expected behaviour

2010-04-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854712#action_12854712
 ] 

Robert Muir commented on SOLR-1869:
---

bq. The CharArrayMap is more performant in lookup, but you are right, we may 
need posincr.

we don't need it for the current implementation, as we clear() the chararrayset 
when we encounter a term of posincr  0.
so the set is only a set of seen terms at some position.

 RemoveDuplicatesTokenFilter doest have expected behaviour
 -

 Key: SOLR-1869
 URL: https://issues.apache.org/jira/browse/SOLR-1869
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Joe Calderon
Priority: Minor
 Attachments: SOLR-1869.patch


 the RemoveDuplicatesTokenFilter seems broken as it initializes its map and 
 attributes at the class level and not within its constructor
 in addition i would think the expected behaviour would be to remove identical 
 terms with the same offset positions, instead it looks like it removes 
 duplicates based on position increment which wont work when using it after 
 something like the edgengram filter. when i posted this to the mailing list 
 even erik hatcher seemed to think thats what this filter was supposed to do...
 attaching a patch that has the expected behaviour and initializes variables 
 in constructor

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1865) ignore byte-order markers in SolrResourceLoader

2010-04-05 Thread Robert Muir (JIRA)
ignore byte-order markers in SolrResourceLoader
---

 Key: SOLR-1865
 URL: https://issues.apache.org/jira/browse/SOLR-1865
 Project: Solr
  Issue Type: Improvement
Reporter: Robert Muir
Priority: Minor
 Fix For: 3.1
 Attachments: SOLR-1865.patch

If you create say a stopwords list with windows notepad or other editors and 
save as UTF-8, 
some of these editors will insert a byte-order marker (zero-width no-break 
space) as the first 
character of the file.

http://www.lucidimagination.com/search/document/5101871231fc95af/is_this_a_bug_of_the_ressourceloader


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1865) ignore byte-order markers in SolrResourceLoader

2010-04-05 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1865:
--

Attachment: SOLR-1865.patch

attached is a patch to ignore BOM's at the beginning of files loaded with 
getLines()


 ignore byte-order markers in SolrResourceLoader
 ---

 Key: SOLR-1865
 URL: https://issues.apache.org/jira/browse/SOLR-1865
 Project: Solr
  Issue Type: Improvement
Reporter: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: SOLR-1865.patch


 If you create say a stopwords list with windows notepad or other editors and 
 save as UTF-8, 
 some of these editors will insert a byte-order marker (zero-width no-break 
 space) as the first 
 character of the file.
 http://www.lucidimagination.com/search/document/5101871231fc95af/is_this_a_bug_of_the_ressourceloader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1860) improve stopwords list handling

2010-04-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853684#action_12853684
 ] 

Robert Muir commented on SOLR-1860:
---

bq. Either we can setup a simple export and conversion to the format Solr 
currently supports now, and if/when someon updates StopFilterFactory to support 
the new format, then we can stop converting when we export

Well, this isn't that big of a deal either way. 

In Lucene we have a helper class called WordListLoader that supports loading 
this format from an InputStream.

One idea to consider: we could try merging some of what SolrResourceLoader does 
with this WordListLoader, then its all tested and in one place. 
it appears there might be some duplication of effort here... e.g. how long till 
a lucene user complains about UTF-8 bom markers in their stoplists :)

We can still use ant to keep the files in sync automatically from the lucene 
copies.


 improve stopwords list handling
 ---

 Key: SOLR-1860
 URL: https://issues.apache.org/jira/browse/SOLR-1860
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor

 Currently Solr makes it easy to use english stopwords for StopFilter or 
 CommonGramsFilter.
 Recently in lucene, we added stopwords lists (mostly, but not all from 
 snowball) to all the language analyzers.
 So it would be nice if a user can easily specify that they want to use a 
 french stopword list, and use it for StopFilter or CommonGrams.
 The ones from snowball, are however formatted in a different manner than the 
 others (although in Lucene we have parsers to deal with this).
 Additionally, we abstract this from Lucene users by adding a static 
 getDefaultStopSet to all analyzers.
 There are two approaches, the first one I think I prefer the most, but I'm 
 not sure it matters as long as we have good examples (maybe a foreign 
 language example schema?)
 1. The user would specify something like:
  filter class=solr.StopFilterFactory 
 fromAnalyzer=org.apache.lucene.analysis.FrenchAnalyzer .../
  This would just grab the CharArraySet from the FrenchAnalyzer's 
 getDefaultStopSet method, who cares where it comes from or how its loaded.
 2. We add support for snowball-formatted stopwords lists, and the user could 
 something like:
 filter class=solr.StopFilterFactory 
 words=org/apache/lucene/analysis/snowball/french_stop.txt format=snowball 
 ... /
 The disadvantage to this is they have to know where the list is, what format 
 its in, etc. For example: snowball doesn't provide Romanian or Turkish
 stopword lists to go along with their stemmers, so we had to add our own.
 Let me know what you guys think, and I will create a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries

2010-04-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852811#action_12852811
 ] 

Robert Muir commented on SOLR-1852:
---

Committed the test to trunk: revision 930262.

 enablePositionIncrements=true can cause searches to fail when they are 
 parsed as phrase queries
 -

 Key: SOLR-1852
 URL: https://issues.apache.org/jira/browse/SOLR-1852
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Peter Wolanin
Assignee: Robert Muir
 Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch


 Symptom: searching for a string like a domain name containing a '.', the Solr 
 1.4 analyzer tells me that I will get a match, but when I enter the search 
 either in the client or directly in Solr, the search fails. 
 test string:  Identi.ca
 queries that fail:  IdentiCa, Identi.ca, Identi-ca
 query that matches: Identi ca
 schema in use is:
 http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1
 Screen shots:
 analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
 dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
 dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
 standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
 Whether or not the bug appears is determined by the surrounding text:
 would be great to have support for Identi.ca on the follow block
 fails to match Identi.ca, but putting the content on its own or in another 
 sentence:
 Support Identi.ca
 the search matches.  Testing suggests the word for is the problem, and it 
 looks like the bug occurs when a stop word preceeds a word that is split up 
 using the word delimiter filter.
 Setting enablePositionIncrements=false in the stop filter and reindexing 
 causes the searches to match.
 According to Mark Miller in #solr, this bug appears to be fixed already in 
 Solr trunk, either due to the upgraded lucene or changes to the 
 WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1860) improve stopwords list handling

2010-04-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852978#action_12852978
 ] 

Robert Muir commented on SOLR-1860:
---

A third idea from Hoss Man:

We should make it easy to edit these lists like english.
So an idea is to create an intl/ folder or similar under the example with 
stopwords_fr.txt, stopwords_de.txt
Additionally we could have a schema-intl.xml with example types 'text_fr', 
'text_de', etc setup for various languages.
I like this idea best.


 improve stopwords list handling
 ---

 Key: SOLR-1860
 URL: https://issues.apache.org/jira/browse/SOLR-1860
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor

 Currently Solr makes it easy to use english stopwords for StopFilter or 
 CommonGramsFilter.
 Recently in lucene, we added stopwords lists (mostly, but not all from 
 snowball) to all the language analyzers.
 So it would be nice if a user can easily specify that they want to use a 
 french stopword list, and use it for StopFilter or CommonGrams.
 The ones from snowball, are however formatted in a different manner than the 
 others (although in Lucene we have parsers to deal with this).
 Additionally, we abstract this from Lucene users by adding a static 
 getDefaultStopSet to all analyzers.
 There are two approaches, the first one I think I prefer the most, but I'm 
 not sure it matters as long as we have good examples (maybe a foreign 
 language example schema?)
 1. The user would specify something like:
  filter class=solr.StopFilterFactory 
 fromAnalyzer=org.apache.lucene.analysis.FrenchAnalyzer .../
  This would just grab the CharArraySet from the FrenchAnalyzer's 
 getDefaultStopSet method, who cares where it comes from or how its loaded.
 2. We add support for snowball-formatted stopwords lists, and the user could 
 something like:
 filter class=solr.StopFilterFactory 
 words=org/apache/lucene/analysis/snowball/french_stop.txt format=snowball 
 ... /
 The disadvantage to this is they have to know where the list is, what format 
 its in, etc. For example: snowball doesn't provide Romanian or Turkish
 stopword lists to go along with their stemmers, so we had to add our own.
 Let me know what you guys think, and I will create a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1859) speed up indexing for example schema

2010-04-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852375#action_12852375
 ] 

Robert Muir commented on SOLR-1859:
---

Any objections? If not I would like to commit later today.

Thanks!

 speed up indexing for example schema
 

 Key: SOLR-1859
 URL: https://issues.apache.org/jira/browse/SOLR-1859
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1859.patch


 The example schema should use the lucene core PorterStemmer (coded in Java by 
 Martin Porter)
  instead of the Snowball one that is auto-generated code.
 Although we have sped up the Snowball stemmer, its still pretty slow and the 
 example should be fast.
 Below is the output of ant test -Dtestcase=TestIndexingPerformance 
 -Dargs=-server -Diter=10
 These results are consistent with large document indexing times that I have 
 seen on large english
 collections with Lucene, we double indexing speed.
 {noformat}
 solr1.5branch:
 iter=10 time=5841 throughput=17120
 iter=10 time=5839 throughput=17126
 iter=10 time=6017 throughput=16619
 trunk (unpatched):
 iter=10 time=4132 throughput=24201
 iter=10 time=4142 throughput=24142
 iter=10 time=4151 throughput=24090
 trunk (patched)
 iter=10 time=2998 throughput=33355
 iter=10 time=3021 throughput=33101
 iter=10 time=3006 throughput=33266
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-1859) speed up indexing for example schema

2010-04-01 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-1859.
---

Resolution: Fixed

Committed revision 930050.

 speed up indexing for example schema
 

 Key: SOLR-1859
 URL: https://issues.apache.org/jira/browse/SOLR-1859
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1859.patch


 The example schema should use the lucene core PorterStemmer (coded in Java by 
 Martin Porter)
  instead of the Snowball one that is auto-generated code.
 Although we have sped up the Snowball stemmer, its still pretty slow and the 
 example should be fast.
 Below is the output of ant test -Dtestcase=TestIndexingPerformance 
 -Dargs=-server -Diter=10
 These results are consistent with large document indexing times that I have 
 seen on large english
 collections with Lucene, we double indexing speed.
 {noformat}
 solr1.5branch:
 iter=10 time=5841 throughput=17120
 iter=10 time=5839 throughput=17126
 iter=10 time=6017 throughput=16619
 trunk (unpatched):
 iter=10 time=4132 throughput=24201
 iter=10 time=4142 throughput=24142
 iter=10 time=4151 throughput=24090
 trunk (patched)
 iter=10 time=2998 throughput=33355
 iter=10 time=3021 throughput=33101
 iter=10 time=3006 throughput=33266
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1860) improve stopwords list handling

2010-04-01 Thread Robert Muir (JIRA)
improve stopwords list handling
---

 Key: SOLR-1860
 URL: https://issues.apache.org/jira/browse/SOLR-1860
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor


Currently Solr makes it easy to use english stopwords for StopFilter or 
CommonGramsFilter.
Recently in lucene, we added stopwords lists (mostly, but not all from 
snowball) to all the language analyzers.

So it would be nice if a user can easily specify that they want to use a french 
stopword list, and use it for StopFilter or CommonGrams.

The ones from snowball, are however formatted in a different manner than the 
others (although in Lucene we have parsers to deal with this).
Additionally, we abstract this from Lucene users by adding a static 
getDefaultStopSet to all analyzers.

There are two approaches, the first one I think I prefer the most, but I'm not 
sure it matters as long as we have good examples (maybe a foreign language 
example schema?)

1. The user would specify something like:

 filter class=solr.StopFilterFactory 
fromAnalyzer=org.apache.lucene.analysis.FrenchAnalyzer .../
 This would just grab the CharArraySet from the FrenchAnalyzer's 
getDefaultStopSet method, who cares where it comes from or how its loaded.

2. We add support for snowball-formatted stopwords lists, and the user could 
something like:

filter class=solr.StopFilterFactory 
words=org/apache/lucene/analysis/snowball/french_stop.txt format=snowball 
... /
The disadvantage to this is they have to know where the list is, what format 
its in, etc. For example: snowball doesn't provide Romanian or Turkish
stopword lists to go along with their stemmers, so we had to add our own.

Let me know what you guys think, and I will create a patch.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (SOLR-1740) ShingleFilterFactory improvements

2010-04-01 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned SOLR-1740:
-

Assignee: Robert Muir

 ShingleFilterFactory improvements
 -

 Key: SOLR-1740
 URL: https://issues.apache.org/jira/browse/SOLR-1740
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 1.5
Reporter: Steven Rowe
Assignee: Robert Muir
Priority: Minor
 Attachments: SOLR-1740.patch


 ShingleFilterFactory should allow specification of minimum shingle size (in 
 addition to maximum shingle size), as well as the separator to use between 
 tokens.  These are implemented at LUCENE-2218.  The attached patch allows 
 ShingleFilterFactory to accept configuration of these items, and includes 
 tests against the new functionality in TestShingleFilterFactory.  
 Solr will have to upgrade to lucene-analyzers-3.1-dev.jar before the attached 
 patch will apply.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1740) ShingleFilterFactory improvements

2010-04-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852686#action_12852686
 ] 

Robert Muir commented on SOLR-1740:
---

Now that we are on Lucene 3.1, it seems like it would be useful to add these 
new capabilities to the factory?

 ShingleFilterFactory improvements
 -

 Key: SOLR-1740
 URL: https://issues.apache.org/jira/browse/SOLR-1740
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 1.5
Reporter: Steven Rowe
Assignee: Robert Muir
Priority: Minor
 Attachments: SOLR-1740.patch


 ShingleFilterFactory should allow specification of minimum shingle size (in 
 addition to maximum shingle size), as well as the separator to use between 
 tokens.  These are implemented at LUCENE-2218.  The attached patch allows 
 ShingleFilterFactory to accept configuration of these items, and includes 
 tests against the new functionality in TestShingleFilterFactory.  
 Solr will have to upgrade to lucene-analyzers-3.1-dev.jar before the attached 
 patch will apply.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1312) BufferedTokenStream should use new Lucene 2.9 TokenStream API

2010-04-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852687#action_12852687
 ] 

Robert Muir commented on SOLR-1312:
---

Hello, I recommend we cancel this issue.

No Solr tokenstreams extend this BufferedTokenStream API anymore, as it is 
bound to Token and does not support reuse.
Currently this class is marked deprecated in trunk, with a backwards 
compatibility layer.

If we think that an API like this is useful, we should make a new 
BufferedTokenStream-like API that uses AttributeSource
instead of Token, but this API would not support reuse and would not be very 
performant, as it would have to use
cloneAttributes() and copyTo() instead of captureState() and restoreState()


 BufferedTokenStream should use new Lucene 2.9 TokenStream API
 -

 Key: SOLR-1312
 URL: https://issues.apache.org/jira/browse/SOLR-1312
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 1.4
Reporter: Tom Burton-West
Priority: Minor

 Since Solr 1.4 will be using Lucene 2.9, the Solr TokenFilters should 
 probably be updated  to use the Lucene 2.9 TokenStream API.   This issue is 
 to put BufferedTokenStream on the list of Filters that need updating. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1740) ShingleFilterFactory improvements

2010-04-01 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1740:
--

Attachment: SOLR-1740.patch

Steven's patch, synced to trunk.

I plan to commit shortly, thanks for the configuration tests Steven.

 ShingleFilterFactory improvements
 -

 Key: SOLR-1740
 URL: https://issues.apache.org/jira/browse/SOLR-1740
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 1.5
Reporter: Steven Rowe
Assignee: Robert Muir
Priority: Minor
 Attachments: SOLR-1740.patch, SOLR-1740.patch


 ShingleFilterFactory should allow specification of minimum shingle size (in 
 addition to maximum shingle size), as well as the separator to use between 
 tokens.  These are implemented at LUCENE-2218.  The attached patch allows 
 ShingleFilterFactory to accept configuration of these items, and includes 
 tests against the new functionality in TestShingleFilterFactory.  
 Solr will have to upgrade to lucene-analyzers-3.1-dev.jar before the attached 
 patch will apply.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1740) ShingleFilterFactory improvements

2010-04-01 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1740:
--

Affects Version/s: (was: 1.5)
   3.1
Fix Version/s: 3.1

 ShingleFilterFactory improvements
 -

 Key: SOLR-1740
 URL: https://issues.apache.org/jira/browse/SOLR-1740
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Steven Rowe
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: SOLR-1740.patch, SOLR-1740.patch


 ShingleFilterFactory should allow specification of minimum shingle size (in 
 addition to maximum shingle size), as well as the separator to use between 
 tokens.  These are implemented at LUCENE-2218.  The attached patch allows 
 ShingleFilterFactory to accept configuration of these items, and includes 
 tests against the new functionality in TestShingleFilterFactory.  
 Solr will have to upgrade to lucene-analyzers-3.1-dev.jar before the attached 
 patch will apply.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-1740) ShingleFilterFactory improvements

2010-04-01 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-1740.
---

Resolution: Fixed

Committed revision 930163. Thanks Steven!

 ShingleFilterFactory improvements
 -

 Key: SOLR-1740
 URL: https://issues.apache.org/jira/browse/SOLR-1740
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Steven Rowe
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: SOLR-1740.patch, SOLR-1740.patch


 ShingleFilterFactory should allow specification of minimum shingle size (in 
 addition to maximum shingle size), as well as the separator to use between 
 tokens.  These are implemented at LUCENE-2218.  The attached patch allows 
 ShingleFilterFactory to accept configuration of these items, and includes 
 tests against the new functionality in TestShingleFilterFactory.  
 Solr will have to upgrade to lucene-analyzers-3.1-dev.jar before the attached 
 patch will apply.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1857) cleanup and sync analysis with lucene trunk

2010-03-31 Thread Robert Muir (JIRA)
cleanup and sync analysis with lucene trunk
---

 Key: SOLR-1857
 URL: https://issues.apache.org/jira/browse/SOLR-1857
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
 Fix For: 3.1


Solr works on the lucene trunk, but uses a lot of deprecated APIs.
Additionally two factories are missing, the Keyword and StemmerOverride filters.
The code can be improved with 3.x's generics support, removing casts, etc.
Finally there is some code duplication with lucene, and some cleanup (such as 
deprecating factories for stuff thats deprecated in trunk).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1857) cleanup and sync analysis with lucene trunk

2010-03-31 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1857:
--

Attachment: SOLR-1857.patch

attached is a regrettably large patch to sync us up, and clean things up a bit.

this removes all use of deprecated lucene APIs, except via things that are now 
deprecated in Solr itself.

All tests pass.

 cleanup and sync analysis with lucene trunk
 ---

 Key: SOLR-1857
 URL: https://issues.apache.org/jira/browse/SOLR-1857
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1857.patch


 Solr works on the lucene trunk, but uses a lot of deprecated APIs.
 Additionally two factories are missing, the Keyword and StemmerOverride 
 filters.
 The code can be improved with 3.x's generics support, removing casts, etc.
 Finally there is some code duplication with lucene, and some cleanup (such as 
 deprecating factories for stuff thats deprecated in trunk).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1857) cleanup and sync analysis with lucene trunk

2010-03-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852079#action_12852079
 ] 

Robert Muir commented on SOLR-1857:
---

if no one objects, I would like to commit in a day or two. If anyone wants to 
review, thats great... i know its large...

 cleanup and sync analysis with lucene trunk
 ---

 Key: SOLR-1857
 URL: https://issues.apache.org/jira/browse/SOLR-1857
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1857.patch


 Solr works on the lucene trunk, but uses a lot of deprecated APIs.
 Additionally two factories are missing, the Keyword and StemmerOverride 
 filters.
 The code can be improved with 3.x's generics support, removing casts, etc.
 Finally there is some code duplication with lucene, and some cleanup (such as 
 deprecating factories for stuff thats deprecated in trunk).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (SOLR-1857) cleanup and sync analysis with lucene trunk

2010-03-31 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned SOLR-1857:
-

Assignee: Robert Muir

 cleanup and sync analysis with lucene trunk
 ---

 Key: SOLR-1857
 URL: https://issues.apache.org/jira/browse/SOLR-1857
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1857.patch


 Solr works on the lucene trunk, but uses a lot of deprecated APIs.
 Additionally two factories are missing, the Keyword and StemmerOverride 
 filters.
 The code can be improved with 3.x's generics support, removing casts, etc.
 Finally there is some code duplication with lucene, and some cleanup (such as 
 deprecating factories for stuff thats deprecated in trunk).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1857) cleanup and sync analysis with lucene trunk

2010-03-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852213#action_12852213
 ] 

Robert Muir commented on SOLR-1857:
---

bq. I just did a 5 min review, not line-by-line, but seems fine in general. 

Thanks for the review Yonik, I'll move forward then and commit soon... 

I'll open an issue next for the default schema speedups... looking forward to 
this :)


 cleanup and sync analysis with lucene trunk
 ---

 Key: SOLR-1857
 URL: https://issues.apache.org/jira/browse/SOLR-1857
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1857.patch


 Solr works on the lucene trunk, but uses a lot of deprecated APIs.
 Additionally two factories are missing, the Keyword and StemmerOverride 
 filters.
 The code can be improved with 3.x's generics support, removing casts, etc.
 Finally there is some code duplication with lucene, and some cleanup (such as 
 deprecating factories for stuff thats deprecated in trunk).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries

2010-03-31 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned SOLR-1852:
-

Assignee: Robert Muir

 enablePositionIncrements=true can cause searches to fail when they are 
 parsed as phrase queries
 -

 Key: SOLR-1852
 URL: https://issues.apache.org/jira/browse/SOLR-1852
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Peter Wolanin
Assignee: Robert Muir
 Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch


 Symptom: searching for a string like a domain name containing a '.', the Solr 
 1.4 analyzer tells me that I will get a match, but when I enter the search 
 either in the client or directly in Solr, the search fails. 
 test string:  Identi.ca
 queries that fail:  IdentiCa, Identi.ca, Identi-ca
 query that matches: Identi ca
 schema in use is:
 http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1
 Screen shots:
 analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
 dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
 dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
 standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
 Whether or not the bug appears is determined by the surrounding text:
 would be great to have support for Identi.ca on the follow block
 fails to match Identi.ca, but putting the content on its own or in another 
 sentence:
 Support Identi.ca
 the search matches.  Testing suggests the word for is the problem, and it 
 looks like the bug occurs when a stop word preceeds a word that is split up 
 using the word delimiter filter.
 Setting enablePositionIncrements=false in the stop filter and reindexing 
 causes the searches to match.
 According to Mark Miller in #solr, this bug appears to be fixed already in 
 Solr trunk, either due to the upgraded lucene or changes to the 
 WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries

2010-03-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852216#action_12852216
 ] 

Robert Muir commented on SOLR-1852:
---

I'm afraid of WDF, but I don't think I am the only one, and I think it would be 
good to fix this bug.

If no one objects, I'd like to commit these patches (testcase and backport the 
trunk filter) to the 1.5 branch in a few days.



 enablePositionIncrements=true can cause searches to fail when they are 
 parsed as phrase queries
 -

 Key: SOLR-1852
 URL: https://issues.apache.org/jira/browse/SOLR-1852
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Peter Wolanin
Assignee: Robert Muir
 Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch


 Symptom: searching for a string like a domain name containing a '.', the Solr 
 1.4 analyzer tells me that I will get a match, but when I enter the search 
 either in the client or directly in Solr, the search fails. 
 test string:  Identi.ca
 queries that fail:  IdentiCa, Identi.ca, Identi-ca
 query that matches: Identi ca
 schema in use is:
 http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1
 Screen shots:
 analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
 dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
 dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
 standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
 Whether or not the bug appears is determined by the surrounding text:
 would be great to have support for Identi.ca on the follow block
 fails to match Identi.ca, but putting the content on its own or in another 
 sentence:
 Support Identi.ca
 the search matches.  Testing suggests the word for is the problem, and it 
 looks like the bug occurs when a stop word preceeds a word that is split up 
 using the word delimiter filter.
 Setting enablePositionIncrements=false in the stop filter and reindexing 
 causes the searches to match.
 According to Mark Miller in #solr, this bug appears to be fixed already in 
 Solr trunk, either due to the upgraded lucene or changes to the 
 WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries

2010-03-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852234#action_12852234
 ] 

Robert Muir commented on SOLR-1852:
---

Peter it is... but admittedly it has not been in trunk for very long, and WDF 
is pretty complex.

It's a bit scary to backport a rewrite of it for this reason, but at the same 
time, we've got this bug 
and the other config bugs found in SOLR-1706, so I think its the right thing to 
do... 


 enablePositionIncrements=true can cause searches to fail when they are 
 parsed as phrase queries
 -

 Key: SOLR-1852
 URL: https://issues.apache.org/jira/browse/SOLR-1852
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Peter Wolanin
Assignee: Robert Muir
 Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch


 Symptom: searching for a string like a domain name containing a '.', the Solr 
 1.4 analyzer tells me that I will get a match, but when I enter the search 
 either in the client or directly in Solr, the search fails. 
 test string:  Identi.ca
 queries that fail:  IdentiCa, Identi.ca, Identi-ca
 query that matches: Identi ca
 schema in use is:
 http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1
 Screen shots:
 analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
 dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
 dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
 standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
 Whether or not the bug appears is determined by the surrounding text:
 would be great to have support for Identi.ca on the follow block
 fails to match Identi.ca, but putting the content on its own or in another 
 sentence:
 Support Identi.ca
 the search matches.  Testing suggests the word for is the problem, and it 
 looks like the bug occurs when a stop word preceeds a word that is split up 
 using the word delimiter filter.
 Setting enablePositionIncrements=false in the stop filter and reindexing 
 causes the searches to match.
 According to Mark Miller in #solr, this bug appears to be fixed already in 
 Solr trunk, either due to the upgraded lucene or changes to the 
 WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1859) speed up indexing for example schema

2010-03-31 Thread Robert Muir (JIRA)
speed up indexing for example schema


 Key: SOLR-1859
 URL: https://issues.apache.org/jira/browse/SOLR-1859
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1


The example schema should use the lucene core PorterStemmer (coded in Java by 
Martin Porter)
 instead of the Snowball one that is auto-generated code.

Although we have sped up the Snowball stemmer, its still pretty slow and the 
example should be fast.

Below is the output of ant test -Dtestcase=TestIndexingPerformance 
-Dargs=-server -Diter=10
These results are consistent with large document indexing times that I have 
seen on large english
collections with Lucene, we double indexing speed.

{noformat}
solr1.5branch:
iter=10 time=5841 throughput=17120
iter=10 time=5839 throughput=17126
iter=10 time=6017 throughput=16619

trunk (unpatched):
iter=10 time=4132 throughput=24201
iter=10 time=4142 throughput=24142
iter=10 time=4151 throughput=24090

trunk (patched)
iter=10 time=2998 throughput=33355
iter=10 time=3021 throughput=33101
iter=10 time=3006 throughput=33266
{noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1859) speed up indexing for example schema

2010-03-31 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1859:
--

Attachment: SOLR-1859.patch

attached is a patch. I fixed every instance for general types like text
in every schema file i could find, including test ones, and commented-out 
instances, too. All tests pass.


 speed up indexing for example schema
 

 Key: SOLR-1859
 URL: https://issues.apache.org/jira/browse/SOLR-1859
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1859.patch


 The example schema should use the lucene core PorterStemmer (coded in Java by 
 Martin Porter)
  instead of the Snowball one that is auto-generated code.
 Although we have sped up the Snowball stemmer, its still pretty slow and the 
 example should be fast.
 Below is the output of ant test -Dtestcase=TestIndexingPerformance 
 -Dargs=-server -Diter=10
 These results are consistent with large document indexing times that I have 
 seen on large english
 collections with Lucene, we double indexing speed.
 {noformat}
 solr1.5branch:
 iter=10 time=5841 throughput=17120
 iter=10 time=5839 throughput=17126
 iter=10 time=6017 throughput=16619
 trunk (unpatched):
 iter=10 time=4132 throughput=24201
 iter=10 time=4142 throughput=24142
 iter=10 time=4151 throughput=24090
 trunk (patched)
 iter=10 time=2998 throughput=33355
 iter=10 time=3021 throughput=33101
 iter=10 time=3006 throughput=33266
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



protwords.txt support in stemmers

2010-03-30 Thread Robert Muir
Hello Solr devs,

One thing we did recently in lucene that I would like to expose in Solr, is
add support for protected words to all stemmers.

So the way this works is that a TokenStream attribute 'KeywordAttribute' is
set, and all the stemfilters know to ignore tokens with this boolean value
set.

We also added two neat tokenfilters that make this easy to use:
* KeywordMarkerFilter: a tokenfilter, that given a set of input words, marks
them as keywords with this attribute so any later stemmer ignores them.
* StemmerOverrideFilter: a tokenfilter, that given a map of input
words-stems, stems them with the dictionary, and marks them as keywords so
any later stemmer ignores them.

We have two choices:
* we could treat this stuff as impl details, and add protwords.txt support
to all stemming factories. we could just wrap the filter with a
keywordmarkerfilter internally.
* we could deprecate the explicit protwords.txt in the few factories that
support it, and instead create a factory for KeywordMarkerFilter.
* we could do something else, e.g. both.

So, to illustrate, by adding a factory for the KeywordMarkerFilter, a user
could do:

filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/
filter class=solr.SomeStemmer/

and get the same effect, instead of having to add support for protwords.txt
to every single stem factory.

I don't really have a personal preference as to how we do it, but it would
be cool to have a plan so we can add these factories and clean a few things
up.

In any event I think we should add a factory for the StemmerOverrideFilter,
so someone can have a text file with exceptions, the dutch handling for
fiets comes to mind.

Thanks

-- 
Robert Muir
rcm...@gmail.com


Re: protwords.txt support in stemmers

2010-03-30 Thread Robert Muir
On Tue, Mar 30, 2010 at 8:33 AM, Yonik Seeley yo...@lucidimagination.comwrote:


 It would also be nice to make the token categories generated by
 tokenizers into tags (like StandardTokenizer's ACRONYM, etc).  A
 tokenizer that detected many of the properties could significantly
 speed up analysis because tokens would not have to be re-analyzed to
 see if they contain mixed case, numbers, hyphens, etc (i.e. the fast
 path for WDF would be checking a bit per token).


I like this idea, but it does seem a little bit dangerous. e.g. the
tokenizer could set one of these values, but if some tokenfilter down the
stream doesnt properly use it, you could introduce bugs (by assuming a word
has no numbers when in fact it now does, due to say, a PatternReplaceFilter)

so i think we would simply end up adding a lot of these redundant checks
back, e.g. you would have to re-analyze the term after any regex replacement
from PatternReplaceFilter to properly set these flags... and it might
introduce a lot of subtle bugs.

-- 
Robert Muir
rcm...@gmail.com


Re: protwords.txt support in stemmers

2010-03-30 Thread Robert Muir
On Tue, Mar 30, 2010 at 8:33 AM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir rcm...@gmail.com wrote:
  We have two choices:
  * we could treat this stuff as impl details, and add protwords.txt
 support
  to all stemming factories. we could just wrap the filter with a
  keywordmarkerfilter internally.
  * we could deprecate the explicit protwords.txt in the few factories that
  support it, and instead create a factory for KeywordMarkerFilter.
  * we could do something else, e.g. both.
 
  So, to illustrate, by adding a factory for the KeywordMarkerFilter, a
 user
  could do:
 
  filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
  filter class=solr.SomeStemmer/
 
  and get the same effect, instead of having to add support for
 protwords.txt
  to every single stem factory.

 Yep, this decomposition seems more powerful.

 Sort of related: for a long time I've had the idea of allowing the
 expression of more complex filter chains that can conditionally
 execute some parts based on tags set by other parts.

 This is straightforward to just hand-code in Java of course, but
 trickier to do well in a declarative setting:

  filter class=solr.Tagger tag=protect words=protwords.txt/
  filter class=solr.SomeStemmer skipTags=protect/

 The idea was to also make this fast by allocating a bit per tag
 (assuming we somehow knew all of the possible ones in a particular
 filter chain) and using a bitfield (long) to set and test.  I was
 planning on using Token.flags before the new analysis attribute stuff
 came into being.

 It would also be nice to make the token categories generated by
 tokenizers into tags (like StandardTokenizer's ACRONYM, etc).  A
 tokenizer that detected many of the properties could significantly
 speed up analysis because tokens would not have to be re-analyzed to
 see if they contain mixed case, numbers, hyphens, etc (i.e. the fast
 path for WDF would be checking a bit per token).

 Anyway, probably something for another day, but I wanted to throw it out
 there.

 -Yonik
 http://www.lucidimagination.com


Sorta unrelated too, but on the same topic of performance, I'd really like
to improve the indexing speed with the example schema, and thats my hidden
motivation here.

I think we've already significantly improved WDF and SnowballPorter
performance in trunk, but if we add this support we could at least consider
switching to the much much faster PorterStemmer in the Lucene core for the
example schema, as it would then support protected words via this mechanism.

Do you have a preferred way to benchmark type text for example? Ideally in
the future the lucene benchmark package could support benchmarking Solr
schema definitions... but we aren't there yet!

-- 
Robert Muir
rcm...@gmail.com


Re: protwords.txt support in stemmers

2010-03-30 Thread Robert Muir
On Tue, Mar 30, 2010 at 10:32 AM, Yonik Seeley
yo...@lucidimagination.comwrote:


 Unfortunately not... it's normally something ad hoc like uploading a
 big CSV file, etc.

 There's also the very simplistic TestIndexingPerformance.
 ant test -Dtestcase=TestIndexingPerformance -Dargs=-server
 -Diter=10; grep throughput
 build/test-results/*TestIndexingPerformance*


Cool, as a quick stab at this, I ran this 3 times on solr 1.5, solr trunk,
and solr trunk with the proposed mod:
The results are consistent with what I have seen indexing large docs with
just lucene, too.

solr1.5branch:
iter=10 time=5841 throughput=17120
iter=10 time=5839 throughput=17126
iter=10 time=6017 throughput=16619

trunk:
iter=10 time=4132 throughput=24201
iter=10 time=4142 throughput=24142
iter=10 time=4151 throughput=24090

trunk: swap Snowball Porter with Core Lucene Porter
iter=10 time=2978 throughput=33579
iter=10 time=2973 throughput=33636
iter=10 time=2925 throughput=34188

-- 
Robert Muir
rcm...@gmail.com


[jira] Updated: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries

2010-03-28 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1852:
--

Attachment: SOLR-1852_testcase.patch

attached is a testcase demonstrating the bug.

The problem is that if you have, for example the lucene.solr, where the is 
a stopword, the Solr 1.4 WordDelimiter bumps the position increment of *both* 
lucene and solr tokens:

* lucene (posInc=2)
* solr (posInc=2)
* lucenesolr (posInc=0)

Instead it should look like:

* lucene (posInc=2)
* solr (posInc=1)
* lucenesolr (posInc=0)

In my opinion the behavior of trunk is correct, and this is a bug. 
But I don't know how to fix just Solr 1.4's WDF in a better way than dropping 
in the entire rewritten WDF...


 enablePositionIncrements=true can cause searches to fail when they are 
 parsed as phrase queries
 -

 Key: SOLR-1852
 URL: https://issues.apache.org/jira/browse/SOLR-1852
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Peter Wolanin
 Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch


 Symptom: searching for a string like a domain name containing a '.', the Solr 
 1.4 analyzer tells me that I will get a match, but when I enter the search 
 either in the client or directly in Solr, the search fails. 
 test string:  Identi.ca
 queries that fail:  IdentiCa, Identi.ca, Identi-ca
 query that matches: Identi ca
 schema in use is:
 http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1
 Screen shots:
 analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
 dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
 dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
 standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
 Whether or not the bug appears is determined by the surrounding text:
 would be great to have support for Identi.ca on the follow block
 fails to match Identi.ca, but putting the content on its own or in another 
 sentence:
 Support Identi.ca
 the search matches.  Testing suggests the word for is the problem, and it 
 looks like the bug occurs when a stop word preceeds a word that is split up 
 using the word delimiter filter.
 Setting enablePositionIncrements=false in the stop filter and reindexing 
 causes the searches to match.
 According to Mark Miller in #solr, this bug appears to be fixed already in 
 Solr trunk, either due to the upgraded lucene or changes to the 
 WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

2010-03-27 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-1710.
---

   Resolution: Fixed
Fix Version/s: 3.1
 Assignee: Mark Miller

This was resolved in revision 922957.

 convert worddelimiterfilter to new tokenstream API
 --

 Key: SOLR-1710
 URL: https://issues.apache.org/jira/browse/SOLR-1710
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Robert Muir
Assignee: Mark Miller
 Fix For: 3.1

 Attachments: SOLR-1710-readable.patch, SOLR-1710-readable.patch, 
 SOLR-1710.patch, SOLR-1710.patch


 This one was a doozy, attached is a patch to convert it to the new 
 tokenstream API.
 Some of the logic was split into WordDelimiterIterator (exposes a 
 BreakIterator-like api for iterating subwords)
 the filter is much more efficient now, no cloning.
 before applying the patch, copy the existing WordDelimiterFilter to 
 OriginalWordDelimiterFilter
 the patch includes a testcase (TestWordDelimiterBWComp) which generates 
 random strings from various subword combinations.
 For each random string, it compares output against the existing 
 WordDelimiterFilter for all 512 combinations of boolean parameters.
 NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these 
 combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-1657) convert the rest of solr to use the new tokenstream API

2010-03-27 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-1657.
---

   Resolution: Fixed
Fix Version/s: 3.1
 Assignee: Mark Miller

This was resolved in revision 922957.

 convert the rest of solr to use the new tokenstream API
 ---

 Key: SOLR-1657
 URL: https://issues.apache.org/jira/browse/SOLR-1657
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
Assignee: Mark Miller
 Fix For: 3.1

 Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, 
 SOLR-1657.patch, SOLR-1657_part2.patch, 
 SOLR-1657_synonyms_ugly_slightly_less_slow.patch, 
 SOLR-1657_synonyms_ugly_slow.patch


 org.apache.solr.analysis:
 -BufferedTokenStream-
  - -CommonGramsFilter-
  - -CommonGramsQueryFilter-
  - -RemoveDuplicatesTokenFilter-
 -CapitalizationFilterFactory-
 -HyphenatedWordsFilter-
 -LengthFilter (deprecated, remove)-
 SynonymFilter
 SynonymFilterFactory
 -WordDelimiterFilter-
 -org.apache.solr.handler:-
 -AnalysisRequestHandler-
 -AnalysisRequestHandlerBase-
 -org.apache.solr.handler.component:-
 -QueryElevationComponent-
 -SpellCheckComponent-
 -org.apache.solr.highlight:-
 -DefaultSolrHighlighter-
 -org.apache.solr.spelling:-
 -SpellingQueryConverter-

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-1706) wrong tokens output from WordDelimiterFilter depending upon options

2010-03-27 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-1706.
---

   Resolution: Fixed
Fix Version/s: 3.1
 Assignee: Mark Miller

This was resolved in revision 922957.

 wrong tokens output from WordDelimiterFilter depending upon options
 ---

 Key: SOLR-1706
 URL: https://issues.apache.org/jira/browse/SOLR-1706
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Affects Versions: 1.4
Reporter: Robert Muir
Assignee: Mark Miller
 Fix For: 3.1


 below you can see that when I have requested to only output numeric 
 concatenations (not words), some words are still sometimes output, ignoring 
 the options i have provided, and even then, in a very inconsistent way.
 {code}
   assertWdf(Super-Duper-XL500-42-AutoCoder's, 0,0,0,1,0,0,0,0,1, null,
 new String[] { 42, AutoCoder },
 new int[] { 18, 21 },
 new int[] { 20, 30 },
 new int[] { 1, 1 });
   assertWdf(Super-Duper-XL500-42-AutoCoder's-56, 0,0,0,1,0,0,0,0,1, null,
 new String[] { 42, AutoCoder, 56 },
 new int[] { 18, 21, 33 },
 new int[] { 20, 30, 35 },
 new int[] { 1, 1, 1 });
   assertWdf(Super-Duper-XL500-AB-AutoCoder's, 0,0,0,1,0,0,0,0,1, null,
 new String[] {  },
 new int[] {  },
 new int[] {  },
 new int[] {  });
   assertWdf(Super-Duper-XL500-42-AutoCoder's-BC, 0,0,0,1,0,0,0,0,1, null,
 new String[] { 42 },
 new int[] { 18 },
 new int[] { 20 },
 new int[] { 1 });
 {code}
 where assertWdf is 
 {code}
   void assertWdf(String text, int generateWordParts, int generateNumberParts,
   int catenateWords, int catenateNumbers, int catenateAll,
   int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
   int stemEnglishPossessive, CharArraySet protWords, String expected[],
   int startOffsets[], int endOffsets[], String types[], int posIncs[])
   throws IOException {
 TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
 WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
 generateNumberParts, catenateWords, catenateNumbers, catenateAll,
 splitOnCaseChange, preserveOriginal, splitOnNumerics,
 stemEnglishPossessive, protWords);
 assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
 posIncs);
   }
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-1820) Remove custom greek/russian charsets encoding

2010-03-27 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-1820.
---

   Resolution: Fixed
Fix Version/s: 3.1
 Assignee: Robert Muir

This was resolved in revision 922964.

 Remove custom greek/russian charsets encoding
 -

 Key: SOLR-1820
 URL: https://issues.apache.org/jira/browse/SOLR-1820
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: SOLR-1820.patch


 In Solr 1.4, we deprecated support for 'custom encodings embedded inside 
 unicode'.
 This is where the analyzer in lucene itself did encoding conversions, its 
 better to just let 
 analyzers be analyzers, and leave encoding conversion to Java.
 In order to move to Lucene 3.x, we need to remove this deprecated support, 
 and instead
 issue an error in the factories if you try to do this (instead of a warning).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries

2010-03-27 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850612#action_12850612
 ] 

Robert Muir commented on SOLR-1852:
---

bq. The changes in the patch originate at SOLR-1706 and SOLR-1657, however I 
don't think it's actually the same bug as SOLR-1706 intended to fix since the 
the admin analyzer interface the generated tokens look correct. 

Yeah, I don't like the situation at all, as its not obvious to me at a glance 
how the trunk impl fixes your problem, but at the same time how this changed 
behavior slipped passed the random tests on SOLR-1710.


 enablePositionIncrements=true can cause searches to fail when they are 
 parsed as phrase queries
 -

 Key: SOLR-1852
 URL: https://issues.apache.org/jira/browse/SOLR-1852
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Peter Wolanin
 Attachments: SOLR-1852.patch


 Symptom: searching for a string like a domain name containing a '.', the Solr 
 1.4 analyzer tells me that I will get a match, but when I enter the search 
 either in the client or directly in Solr, the search fails. 
 test string:  Identi.ca
 queries that fail:  IdentiCa, Identi.ca, Identi-ca
 query that matches: Identi ca
 schema in use is:
 http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1
 Screen shots:
 analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
 dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
 dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
 standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
 Whether or not the bug appears is determined by the surrounding text:
 would be great to have support for Identi.ca on the follow block
 fails to match Identi.ca, but putting the content on its own or in another 
 sentence:
 Support Identi.ca
 the search matches.  Testing suggests the word for is the problem, and it 
 looks like the bug occurs when a stop word preceeds a word that is split up 
 using the word delimiter filter.
 Setting enablePositionIncrements=false in the stop filter and reindexing 
 causes the searches to match.
 According to Mark Miller in #solr, this bug appears to be fixed already in 
 Solr trunk, either due to the upgraded lucene or changes to the 
 WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries

2010-03-27 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850613#action_12850613
 ] 

Robert Muir commented on SOLR-1852:
---

ok, so your bug relates somehow to how the accumulated position increment gap 
is handled.

This is how your stopword fits into the situation, somehow the new code is 
handling it better  for your case, but perhaps its wrong.

there are quite a few tests in TestWordDelimiter, which it passes, but I'll 
spend some time tonight verifying its correctness before we declare success...

 enablePositionIncrements=true can cause searches to fail when they are 
 parsed as phrase queries
 -

 Key: SOLR-1852
 URL: https://issues.apache.org/jira/browse/SOLR-1852
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Peter Wolanin
 Attachments: SOLR-1852.patch


 Symptom: searching for a string like a domain name containing a '.', the Solr 
 1.4 analyzer tells me that I will get a match, but when I enter the search 
 either in the client or directly in Solr, the search fails. 
 test string:  Identi.ca
 queries that fail:  IdentiCa, Identi.ca, Identi-ca
 query that matches: Identi ca
 schema in use is:
 http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1
 Screen shots:
 analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
 dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
 dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
 standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
 Whether or not the bug appears is determined by the surrounding text:
 would be great to have support for Identi.ca on the follow block
 fails to match Identi.ca, but putting the content on its own or in another 
 sentence:
 Support Identi.ca
 the search matches.  Testing suggests the word for is the problem, and it 
 looks like the bug occurs when a stop word preceeds a word that is split up 
 using the word delimiter filter.
 Setting enablePositionIncrements=false in the stop filter and reindexing 
 causes the searches to match.
 According to Mark Miller in #solr, this bug appears to be fixed already in 
 Solr trunk, either due to the upgraded lucene or changes to the 
 WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: svn commit: r928069 - in /lucene/dev/trunk: lucene/ lucene/backwards/src/test/org/apache/lucene/util/ lucene/contrib/benchmark/src/test/org/apache/lucene/benchmark/ lucene/contrib/benchmark/src/

2010-03-26 Thread Robert Muir
();
 +          } catch(LockReleaseFailedException e) {
 +            // well lets pretend its released anyway
 +          }
 +        }
       } catch (IOException e) {
         throw new RuntimeException(unable to write results, e);
       } finally {
 @@ -227,3 +254,4 @@ public class SolrJUnitResultFormatter im
     sb.append(StringUtils.LINE_SEP);
   }
  }
 +

 Modified: lucene/dev/trunk/solr/build.xml
 URL: 
 http://svn.apache.org/viewvc/lucene/dev/trunk/solr/build.xml?rev=928069r1=928068r2=928069view=diff
 ==
 --- lucene/dev/trunk/solr/build.xml (original)
 +++ lucene/dev/trunk/solr/build.xml Fri Mar 26 21:55:57 2010
 @@ -349,6 +349,7 @@
     pathelement location=${dest}/tests/
     !-- include the solrj classpath and jetty files included in example --
     path refid=compile.classpath.solrj /
 +    pathelement location=${common-solr.dir}/../lucene/build/classes/test 
 /  !-- include some lucene test code --
     pathelement path=${java.class.path}/
   /path


 Modified: lucene/dev/trunk/solr/common-build.xml
 URL: 
 http://svn.apache.org/viewvc/lucene/dev/trunk/solr/common-build.xml?rev=928069r1=928068r2=928069view=diff
 ==
 --- lucene/dev/trunk/solr/common-build.xml (original)
 +++ lucene/dev/trunk/solr/common-build.xml Fri Mar 26 21:55:57 2010
 @@ -103,7 +103,7 @@
   property name=junit.output.dir 
 location=${common-solr.dir}/${dest}/test-results/
   property name=junit.reports 
 location=${common-solr.dir}/${dest}/test-results/reports/
   property name=junit.formatter value=plain/
 -  property name=junit.details.formatter 
 value=org.apache.solr.SolrJUnitResultFormatter/
 +  property name=junit.details.formatter 
 value=org.apache.lucene.util.LuceneJUnitResultFormatter/

   !-- Maven properties --
   property name=maven.build.dir value=${basedir}/build/maven/






-- 
Robert Muir
rcm...@gmail.com


build.xml and lucene test code

2010-03-25 Thread Robert Muir
I noticed that for whatever reason, solr's build.xml doesnt detect if
lucene's test code is out of date.

(I am fooling around with LUCENE-1709 where we will try to do the same
parallel test execution for Lucene, as in Solr, and was moving the
special formatter to lucene when i noticed this).

Don't have any ideas how to fix, but just wanted to mention it so its
not forgotten.

worst case, if/when we resolve LUCENE-1709, you will have to run ant
clean first... but I am sure there is some better ant trickery to
detect this situation, maybe just another task dependency.

-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (SOLR-1835) speed up and improve tests

2010-03-23 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848586#action_12848586
 ] 

Robert Muir commented on SOLR-1835:
---

committed revision 926470 to newtrunk.

if you have problems, please just revert and I will help debug them.
for future speedups, we should try to move ant logic to common-build.xml and 
re-use it for contribs.
this way, DIH tests etc will run in parallel, too. 


 speed up and improve tests
 --

 Key: SOLR-1835
 URL: https://issues.apache.org/jira/browse/SOLR-1835
 Project: Solr
  Issue Type: Improvement
Reporter: Yonik Seeley
 Fix For: 3.1

 Attachments: SOLR-1835.patch, SOLR-1835_parallel.patch, 
 SOLR-1835_parallel.patch, SOLR-1835_parallel.patch, SOLR-1835_parallel.patch


 General test improvements.
 We should use @BeforeClass where possible to avoid per test method overhead, 
 and reuse lucene test utils where possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1835) speed up and improve tests

2010-03-22 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1835:
--

Attachment: SOLR-1835_parallel.patch

attached is a patch to parallelize the tests...
improvements can be done, and contrib too (e.g. DIH)

but this drops my test time to 4:42 on the first try.

 speed up and improve tests
 --

 Key: SOLR-1835
 URL: https://issues.apache.org/jira/browse/SOLR-1835
 Project: Solr
  Issue Type: Improvement
Reporter: Yonik Seeley
 Fix For: 3.1

 Attachments: SOLR-1835.patch, SOLR-1835_parallel.patch


 General test improvements.
 We should use @BeforeClass where possible to avoid per test method overhead, 
 and reuse lucene test utils where possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1835) speed up and improve tests

2010-03-22 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1835:
--

Attachment: SOLR-1835_parallel.patch

updated patch:
* doesnt do parallel for the -Dtestcase= case, but does for all, -Dtestpackage, 
-Dtestpackageroot, etc
* you can make the condition for whether to do parallel or not more complex, 
e.g. nightlies could go sequentially.


 speed up and improve tests
 --

 Key: SOLR-1835
 URL: https://issues.apache.org/jira/browse/SOLR-1835
 Project: Solr
  Issue Type: Improvement
Reporter: Yonik Seeley
 Fix For: 3.1

 Attachments: SOLR-1835.patch, SOLR-1835_parallel.patch, 
 SOLR-1835_parallel.patch


 General test improvements.
 We should use @BeforeClass where possible to avoid per test method overhead, 
 and reuse lucene test utils where possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1835) speed up and improve tests

2010-03-22 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1835:
--

Attachment: SOLR-1835_parallel.patch

attached is a new patch:
* the output from multiple threads is no longer interleaved
* you need to put ant.jar and ant-junit.jar in example/lib for this patch to 
work. these need to be ant 1.7.1 (lucene needs this version anyway i think)


 speed up and improve tests
 --

 Key: SOLR-1835
 URL: https://issues.apache.org/jira/browse/SOLR-1835
 Project: Solr
  Issue Type: Improvement
Reporter: Yonik Seeley
 Fix For: 3.1

 Attachments: SOLR-1835.patch, SOLR-1835_parallel.patch, 
 SOLR-1835_parallel.patch, SOLR-1835_parallel.patch


 General test improvements.
 We should use @BeforeClass where possible to avoid per test method overhead, 
 and reuse lucene test utils where possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1835) speed up and improve tests

2010-03-22 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1835:
--

Attachment: SOLR-1835_parallel.patch

there was a stray slash in the previous version.

this caused some people to mistakenly believe they have a faster computer than 
me.


 speed up and improve tests
 --

 Key: SOLR-1835
 URL: https://issues.apache.org/jira/browse/SOLR-1835
 Project: Solr
  Issue Type: Improvement
Reporter: Yonik Seeley
 Fix For: 3.1

 Attachments: SOLR-1835.patch, SOLR-1835_parallel.patch, 
 SOLR-1835_parallel.patch, SOLR-1835_parallel.patch, SOLR-1835_parallel.patch


 General test improvements.
 We should use @BeforeClass where possible to avoid per test method overhead, 
 and reuse lucene test utils where possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: rough outline of where Solr's going

2010-03-18 Thread Robert Muir
On Thu, Mar 18, 2010 at 11:33 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 On version numbering... my inclination would be to let Solr and Lucene
 use their own version numbers (don't sync them up).  I know it'd
 simplify our lives to have the same version across the board, but
 these numbers are really for our users, telling them when big changes
 were made, back compat broken, etc.  I think that trumps dev
 convenience.

Be sure to consider the deprecations removal, its not possible for
Solr to move to Lucene's trunk without this.

Here are two examples of necessary deprecation removals in the branch
so that Solr can use Lucene's trunk:
https://issues.apache.org/jira/browse/SOLR-1820
http://www.lucidimagination.com/search/document/f07da8e4d69f5bfe/removal_of_deprecated_htmlstrip_tokenizer_factories

It seems to be the consensus that people want a major version change
number when this is done.

So this is an example where the version numbers of Solr really do
relate to Lucene, if we want them to share the same trunk.


-- 
Robert Muir
rcm...@gmail.com


Re: rough outline of where Solr's going

2010-03-18 Thread Robert Muir
On Thu, Mar 18, 2010 at 1:12 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 Ahh, OK.

 Meaning Solr will have to remove deprecated support, which means
 Solr's next released version would be a major release?  Ie 2.0?


Its more complex than this. Solr depends on some lucene contrib
modules, which apparently have no backwards-compatibility policy.

I don't think we want to have to suddenly treat all these contrib
modules like core lucene with regards to backwards compat, some of
them haven't reached that level of maturity yet.

On the other hand, exposing contrib's functionality via Solr is a
great way to get more real users and devs giving feedback and
improvements to help them mature.

But we need to work on how to handle some of this: I suppose spatial
is the worst case (don't really know), where Solr has a dependency on
a Lucene contrib specifically labelled as experimental.

-- 
Robert Muir
rcm...@gmail.com


Re: How do I contribute bug fixes

2010-03-18 Thread Robert Muir
On Thu, Mar 18, 2010 at 6:49 PM, Sanjoy Ghosh san...@yahoo.com wrote:
 Hello,

 Can I submit bug fixes?  If so, what is the procedure?

 Thanks,
 Sanjoy

Hello,

Please take a look at this link: http://wiki.apache.org/solr/HowToContribute

-- 
Robert Muir
rcm...@gmail.com


Re: lucene and solr trunk

2010-03-17 Thread Robert Muir
On Wed, Mar 17, 2010 at 9:09 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 Git, Maven, Hg, etc., all sound great for the future, but let's focus
 now on the baby step (how to re-org svn), today, so we can land the
 Solr upgrade work now being done on a branch...


I agree.

Another thing anyone can do to help if they have a spare few minutes,
is to review the technical work done in the branch and provide
feedback.
The big JIRA issue is located at
https://issues.apache.org/jira/browse/SOLR-1659 and other issues are
linked to it.

-- 
Robert Muir
rcm...@gmail.com


Re: lucene and solr trunk

2010-03-17 Thread Robert Muir
On Wed, Mar 17, 2010 at 12:40 PM, Mark Miller markrmil...@gmail.com wrote:
 Okay, so this looks good to me (a few others seemed to like it - though
 Lucene-Dev was somehow dropped earlier) - lets try this out on the branch?
 (then we can get rid of that horrible branch name ;) )

 Anyone on the current branch object to having to do a quick svn switch?


+1

-- 
Robert Muir
rcm...@gmail.com


Re: rough outline of where Solr's going

2010-03-17 Thread Robert Muir
On Wed, Mar 17, 2010 at 8:15 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 My key point being: Version numbers should communicate the
 significance in change to the *user* of the product, and the users of
 Solr are differnet then the users of Lucene-Java, so even if the releases
 happen in lock step, that doesn't mean the verion numbers should be in
 lock step.


As you stated modules were important to think about for svn location,
then it would only make sense that they are important to think about
for release numbering, too.

So lets say we spin off a lucene-analyzers module, it should be 3.1,
too, as its already a module to some degree, and having a
lucene-analyzers-1.0.jar would be downright misleading.

So from this perspective of modules, with solr being a module
alongside lucene, 3.1 makes a lot of sense, and it also makes sense to
try to release things together if possible so that users aren't
confused.


-- 
Robert Muir
rcm...@gmail.com


Re: lucene and solr trunk

2010-03-16 Thread Robert Muir
On Tue, Mar 16, 2010 at 3:43 AM, Simon Willnauer
simon.willna...@googlemail.com wrote:

 One more thing which I wonder about even more is that this whole
 merging happens so quickly for reasons I don't see right now. I don't
 want to keep anybody from making progress but it appears like a rush
 to me.


By the way, the serious changes we applied to the branch, most of them
have been sitting in JIRA over 3 months not doing much: SOLR-1659

if you follow the linked issues, you can see all the stuff that got
put in the branch... the branch was helpful for me, as I could help
Mark with the ton of little things, like TokenStreams embedded
inside JSP files :)

As its just a branch, if you want to go look at those patches
(especially anything I did) and provide technical feedback, that would
be great!

But I think its a mistake to say things are rushed when the work has
been done for months.

-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845301#action_12845301
 ] 

Robert Muir commented on SOLR-1804:
---

I wonder if you guys have any insight why the results of this test may have 
changed from 16 to 15 between Lucene 3.0 and Lucene 3.1-dev: 
http://svn.apache.org/viewvc?view=revisionrevision=923048

It did not change between Lucene 2.9 and Lucene 3.0, so I'm concerned about why 
the results would change between 3.0 and 3.1-dev. 

One possible explanation would be if Carrot2 used Version.LUCENE_CURRENT 
somewhere in its code. Any ideas?

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845451#action_12845451
 ] 

Robert Muir commented on SOLR-1804:
---

Hi Stanislaw:

Correct, I did not upgrade anything else, just lucene. 

I'm sorry its not exactly related to this issue 
(although If we need to upgrade carrot2 to be compatible with Lucene 3.x, then 
thats ok)

My concern is more that we did something in Lucene between 3.0 
and now that caused the results to be different... though again
this could be explained if somewhere in its code Carrot2 uses some
Lucene analysis component, but doesn't hardwire Version to LUCENE_29.

If all else fails I can try to seek out the svn rev # of Lucene that causes 
this change,
by brute force binary search :)

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845455#action_12845455
 ] 

Robert Muir commented on SOLR-1804:
---

Grant  I am concerned about a possible BW break in Lucene trunk, that is all.
I think its strange that 3.0 and 3.1 jars give different results.

Can you tell me if the clusters are reasonable? here is the output.

{noformat}
junit.framework.AssertionFailedError: number of clusters: [
{labels=[Data Mining Applications], docs=[5, 13, 25, 12, 27],clusters=[]}, 
{labels=[Databases],docs=[15, 21, 7, 17, 11],clusters=[]}, 
{labels=[Knowledge Discovery],docs=[6, 18, 15, 17, 10],clusters=[]}, 
{labels=[Statistical Data Mining],docs=[28, 24, 2, 14],clusters=[]}, 
{labels=[Data Mining Solutions],docs=[5, 22, 8],clusters=[]}, 
{labels=[Data Mining Techniques],docs=[12, 2, 14],clusters=[]}, 
{labels=[Known as Data Mining],docs=[23, 17, 19],clusters=[]}, 
{labels=[Text Mining],docs=[6, 9, 29],clusters=[]}, 
{labels=[Dedicated],docs=[10, 11],clusters=[]}, 
{labels=[Extraction of Hidden Predictive],docs=[3, 11],clusters=[]}, 
{labels=[Information from Large],docs=[3, 7],clusters=[]}, 
{labels=[Neural Networks],docs=[12, 1],clusters=[]}, 
{labels=[Open],docs=[15, 20],clusters=[]}, 
{labels=[Research],docs=[26, 8],clusters=[]}, 
{labels=[Other Topics],docs=[16],clusters=[]}
] expected:16 but was:15
{noformat}

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845474#action_12845474
 ] 

Robert Muir commented on SOLR-1804:
---

Thanks for the confirmation the clusters are ok.

Well, this is embarrassing, it turns out it is a backwards break, 
though documented, and the culprit is yours truly.

This is the reason it gets different results:
{noformat}
* LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default.
  This means that terms with a position increment gap of zero do not
  affect the norms calculation by default.  (Robert Muir)
{noformat}

I'll change the test to expect 15 clusters with Lucene 3.1, thanks :)

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Robert Muir
Hello,

Is there any concern with removing the deprecated HtmlStrip*Tokenizer factories?

These can be done with CharFilter instead and they have some problems
with lucene's trunk.

If no one objects, I'd like to remove these in the branch.
Otherwise, Uwe tells me there is some way to make them work if need be.

Thanks!

-- 
Robert Muir
rcm...@gmail.com


Re: removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Robert Muir
On Mon, Mar 15, 2010 at 5:30 PM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:

 Is there a way we can fix LUCENE-2098 too?


I think this is good to fix, yet removing the deprecations is
unrelated to this slowdown.

The deprecated functionality (HtmlStrip*Tokenizer) is implemented in
terms of the slower CharFilter, so its not any faster, getting rid of
it won't slow anyone down.

That being said I think we should still try to improve the performance
of this stuff, I agree.

-- 
Robert Muir
rcm...@gmail.com


Re: removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Robert Muir
On Mon, Mar 15, 2010 at 7:18 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 In the case of these factories: can't we eliminate the Html*Tokenizers
 themselves, but make the *factories* return the neccessary *Tokenizer
 wrapped in an HtmlStripCharFilter ?

They would not be able to re-use if you did this, because when you
call reset(Reader) on them, the Reader would not be wrapped.


-- 
Robert Muir
rcm...@gmail.com


Re: removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Robert Muir
On Mon, Mar 15, 2010 at 7:25 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 Hmmm... I'm not sure i understand how any declared CharFilter/TOkenizer
 combo will be able to deal with this any better, but i'll take your word
 for it.

you can see this behavior in SolrAnalyzer's reusableTokenStream
method, it re-uses the Tokenizer but wraps the readers with
charStream() [overridden by TokenizerChain to wrap the Reader with
your CharFilter chain].

  @Override
  public TokenStream reusableTokenStream(String fieldName, Reader
reader) throws IOException {
// if (true) return tokenStream(fieldName, reader);
TokenStreamInfo tsi = (TokenStreamInfo)getPreviousTokenStream();
if (tsi != null) {
  tsi.getTokenizer().reset(charStream(reader)); // -- right here



 Kill it then, and we'll just have to start making a list in the
 Upgrading section of CHANGES.txt noting the recommended upgrad path
 for this (and many, many things to come i imagine)


cool, I'll add some additional verbage to the CHANGES in the branch.



-- 
Robert Muir
rcm...@gmail.com


Re: lucene and solr trunk

2010-03-15 Thread Robert Muir
On Mon, Mar 15, 2010 at 11:43 PM, Mark Miller markrmil...@gmail.com wrote:

 Solr moves to Lucene's trunk:
   /java/trunk, /java/trunk/sol

 +1. With the goal of merged dev, merged tests, this looks the best to me.
 Simple to do patches that span both, simple to setup
 Solr to use Lucene trunk rather than jars. Short paths. Simple. I like it.


+1


-- 
Robert Muir
rcm...@gmail.com


Re: lucene and solr trunk

2010-03-15 Thread Robert Muir
On Tue, Mar 16, 2010 at 12:01 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:
  4) should it be possible for people to check out Lucene-Java w/o
 checking out Solr?

 (i suspect a whole lot of people who only care about the core library are
 going to really adamantly not want to have to check out all of Solr just
 to work on the core)

This wouldn't really be merged development now would it?
When I run 'ant test' I want the Solr tests to run, too.
If one breaks because of a change, I want to look at the source and know why.

-- 
Robert Muir
rcm...@gmail.com


Re: lucene and solr trunk

2010-03-15 Thread Robert Muir
On Tue, Mar 16, 2010 at 12:39 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 And as a committer, you should be concerned about things like this ...
 that doesn't mean every user of Lucene-Java who wants to build from source
 or apply their own local patches is going to feel the same way.


Yep, those users probably already hate our backwards tests and the
contrib tests too.


-- 
Robert Muir
rcm...@gmail.com


[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API

2010-03-14 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1657:
--

Attachment: SOLR-1657_synonyms_ugly_slow.patch

A very very ugly, very slow, but simple and conservative conversion of 
SynonymFilter to the new TokenStream API.


 convert the rest of solr to use the new tokenstream API
 ---

 Key: SOLR-1657
 URL: https://issues.apache.org/jira/browse/SOLR-1657
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, 
 SOLR-1657.patch, SOLR-1657_part2.patch, SOLR-1657_synonyms_ugly_slow.patch


 org.apache.solr.analysis:
 -BufferedTokenStream-
  - -CommonGramsFilter-
  - -CommonGramsQueryFilter-
  - -RemoveDuplicatesTokenFilter-
 -CapitalizationFilterFactory-
 -HyphenatedWordsFilter-
 -LengthFilter (deprecated, remove)-
 SynonymFilter
 SynonymFilterFactory
 -WordDelimiterFilter-
 -org.apache.solr.handler:-
 -AnalysisRequestHandler-
 -AnalysisRequestHandlerBase-
 -org.apache.solr.handler.component:-
 -QueryElevationComponent-
 -SpellCheckComponent-
 -org.apache.solr.highlight:-
 -DefaultSolrHighlighter-
 -org.apache.solr.spelling:-
 -SpellingQueryConverter-

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API

2010-03-14 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1657:
--

Attachment: SOLR-1657_synonyms_ugly_slightly_less_slow.patch

attached is a less slow version of the above.
it preserves the fast path from the previous code.

 convert the rest of solr to use the new tokenstream API
 ---

 Key: SOLR-1657
 URL: https://issues.apache.org/jira/browse/SOLR-1657
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, 
 SOLR-1657.patch, SOLR-1657_part2.patch, 
 SOLR-1657_synonyms_ugly_slightly_less_slow.patch, 
 SOLR-1657_synonyms_ugly_slow.patch


 org.apache.solr.analysis:
 -BufferedTokenStream-
  - -CommonGramsFilter-
  - -CommonGramsQueryFilter-
  - -RemoveDuplicatesTokenFilter-
 -CapitalizationFilterFactory-
 -HyphenatedWordsFilter-
 -LengthFilter (deprecated, remove)-
 SynonymFilter
 SynonymFilterFactory
 -WordDelimiterFilter-
 -org.apache.solr.handler:-
 -AnalysisRequestHandler-
 -AnalysisRequestHandlerBase-
 -org.apache.solr.handler.component:-
 -QueryElevationComponent-
 -SpellCheckComponent-
 -org.apache.solr.highlight:-
 -DefaultSolrHighlighter-
 -org.apache.solr.spelling:-
 -SpellingQueryConverter-

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1820) Remove custom greek/russian charsets encoding

2010-03-14 Thread Robert Muir (JIRA)
Remove custom greek/russian charsets encoding
-

 Key: SOLR-1820
 URL: https://issues.apache.org/jira/browse/SOLR-1820
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Reporter: Robert Muir
Priority: Minor


In Solr 1.4, we deprecated support for 'custom encodings embedded inside 
unicode'.

This is where the analyzer in lucene itself did encoding conversions, its 
better to just let 
analyzers be analyzers, and leave encoding conversion to Java.

In order to move to Lucene 3.x, we need to remove this deprecated support, and 
instead
issue an error in the factories if you try to do this (instead of a warning).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1820) Remove custom greek/russian charsets encoding

2010-03-14 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1820:
--

Attachment: SOLR-1820.patch

Attached is a patch that removes the deprecates bits.
If you try to specify the charset param, instead of a warning you get an error.


 Remove custom greek/russian charsets encoding
 -

 Key: SOLR-1820
 URL: https://issues.apache.org/jira/browse/SOLR-1820
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Reporter: Robert Muir
Priority: Minor
 Attachments: SOLR-1820.patch


 In Solr 1.4, we deprecated support for 'custom encodings embedded inside 
 unicode'.
 This is where the analyzer in lucene itself did encoding conversions, its 
 better to just let 
 analyzers be analyzers, and leave encoding conversion to Java.
 In order to move to Lucene 3.x, we need to remove this deprecated support, 
 and instead
 issue an error in the factories if you try to do this (instead of a warning).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



TestEvaluatorBag

2010-03-14 Thread Robert Muir
Hey guys,

I am seeing a test failure for TestEvaluatorBag...

I wonder if you guys have any ideas, thought it might be my locale,
but i changed it and I still hit it consistently.

Thanks!

-- 
Robert Muir
rcm...@gmail.com


Re: TestEvaluatorBag

2010-03-14 Thread Robert Muir
I think this is a platform-timezone dependent problem.

This is why switching my locale didnt work, because the test started
failing, today in the US we switched to Daylight Savings Time and
somehow the test only fails for people with those timezones.

On Sun, Mar 14, 2010 at 4:46 PM, Robert Muir rcm...@gmail.com wrote:
 Hey guys,

 I am seeing a test failure for TestEvaluatorBag...

 I wonder if you guys have any ideas, thought it might be my locale,
 but i changed it and I still hit it consistently.

 Thanks!

 --
 Robert Muir
 rcm...@gmail.com




-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (SOLR-1821) Failing testGetDateFormatEvaluator in TestEvaluatorBag

2010-03-14 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845144#action_12845144
 ] 

Robert Muir commented on SOLR-1821:
---

Nice, fixes the issue.

Can you commit this? It would help us in our current work to ensure we are not 
breaking tests.


 Failing testGetDateFormatEvaluator in TestEvaluatorBag
 --

 Key: SOLR-1821
 URL: https://issues.apache.org/jira/browse/SOLR-1821
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Chris Male
 Attachments: SOLR-1821.patch


 On some TimeZones (such as EDT currently), 
 TestEvaluatorBag.testGetDateFormatEvaluator fails with the following error:
 {code:xml}
 org.junit.ComparisonFailure: 
 Expected :2010-03-12 17:15
 Actual   :2010-03-12 18:15
   at org.junit.Assert.assertEquals(Assert.java:96)
   at org.junit.Assert.assertEquals(Assert.java:116)
   at 
 org.apache.solr.handler.dataimport.TestEvaluatorBag.testGetDateFormatEvaluator(TestEvaluatorBag.java:127)
 {code}
 This seems due to the reliance on the System ticks in order to create the 
 Date to compare against.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (SOLR-1821) Failing testGetDateFormatEvaluator in TestEvaluatorBag

2010-03-14 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned SOLR-1821:
-

Assignee: Robert Muir

 Failing testGetDateFormatEvaluator in TestEvaluatorBag
 --

 Key: SOLR-1821
 URL: https://issues.apache.org/jira/browse/SOLR-1821
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Chris Male
Assignee: Robert Muir
 Attachments: SOLR-1821.patch


 On some TimeZones (such as EDT currently), 
 TestEvaluatorBag.testGetDateFormatEvaluator fails with the following error:
 {code:xml}
 org.junit.ComparisonFailure: 
 Expected :2010-03-12 17:15
 Actual   :2010-03-12 18:15
   at org.junit.Assert.assertEquals(Assert.java:96)
   at org.junit.Assert.assertEquals(Assert.java:116)
   at 
 org.apache.solr.handler.dataimport.TestEvaluatorBag.testGetDateFormatEvaluator(TestEvaluatorBag.java:127)
 {code}
 This seems due to the reliance on the System ticks in order to create the 
 Date to compare against.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-1821) Failing testGetDateFormatEvaluator in TestEvaluatorBag

2010-03-14 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-1821.
---

   Resolution: Fixed
Fix Version/s: 1.5

Committed revision 922991. 

Thanks Chris!

 Failing testGetDateFormatEvaluator in TestEvaluatorBag
 --

 Key: SOLR-1821
 URL: https://issues.apache.org/jira/browse/SOLR-1821
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Chris Male
Assignee: Robert Muir
 Fix For: 1.5

 Attachments: SOLR-1821.patch


 On some TimeZones (such as EDT currently), 
 TestEvaluatorBag.testGetDateFormatEvaluator fails with the following error:
 {code:xml}
 org.junit.ComparisonFailure: 
 Expected :2010-03-12 17:15
 Actual   :2010-03-12 18:15
   at org.junit.Assert.assertEquals(Assert.java:96)
   at org.junit.Assert.assertEquals(Assert.java:116)
   at 
 org.apache.solr.handler.dataimport.TestEvaluatorBag.testGetDateFormatEvaluator(TestEvaluatorBag.java:127)
 {code}
 This seems due to the reliance on the System ticks in order to create the 
 Date to compare against.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API

2010-03-13 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1657:
--

Attachment: SOLR-1657_part2.patch

Here's a separate patch (_part2.patch) for all the remaining tokenstreams.

The only one remaining now is SynonymFilter.

For several areas in this patch, I didn't properly change any APIs to fully
support the new Attributes-based API, I just got them off deprecated methods,
still working with Token, and left TODOs.

I figure it would be better to hash this out later on separate issues, where
we modify those APIs to really take advantage of an Attributes-based API.


 convert the rest of solr to use the new tokenstream API
 ---

 Key: SOLR-1657
 URL: https://issues.apache.org/jira/browse/SOLR-1657
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, 
 SOLR-1657.patch, SOLR-1657_part2.patch


 org.apache.solr.analysis:
 -BufferedTokenStream-
  - -CommonGramsFilter-
  - -CommonGramsQueryFilter-
  - -RemoveDuplicatesTokenFilter-
 -CapitalizationFilterFactory-
 -HyphenatedWordsFilter-
 -LengthFilter (deprecated, remove)-
 SynonymFilter
 SynonymFilterFactory
 -WordDelimiterFilter-
 org.apache.solr.handler:
 AnalysisRequestHandler
 AnalysisRequestHandlerBase
 org.apache.solr.handler.component:
 QueryElevationComponent
 SpellCheckComponent
 org.apache.solr.highlight:
 DefaultSolrHighlighter
 org.apache.solr.spelling:
 SpellingQueryConverter

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API

2010-03-13 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1657:
--

Description: 
org.apache.solr.analysis:
-BufferedTokenStream-
 - -CommonGramsFilter-
 - -CommonGramsQueryFilter-
 - -RemoveDuplicatesTokenFilter-
-CapitalizationFilterFactory-
-HyphenatedWordsFilter-
-LengthFilter (deprecated, remove)-
SynonymFilter
SynonymFilterFactory
-WordDelimiterFilter-

-org.apache.solr.handler:-
-AnalysisRequestHandler-
-AnalysisRequestHandlerBase-

-org.apache.solr.handler.component:-
-QueryElevationComponent-
-SpellCheckComponent-

-org.apache.solr.highlight:-
-DefaultSolrHighlighter-

-org.apache.solr.spelling:-
-SpellingQueryConverter-


  was:
org.apache.solr.analysis:
-BufferedTokenStream-
 - -CommonGramsFilter-
 - -CommonGramsQueryFilter-
 - -RemoveDuplicatesTokenFilter-
-CapitalizationFilterFactory-
-HyphenatedWordsFilter-
-LengthFilter (deprecated, remove)-
SynonymFilter
SynonymFilterFactory
-WordDelimiterFilter-

org.apache.solr.handler:
AnalysisRequestHandler
AnalysisRequestHandlerBase

org.apache.solr.handler.component:
QueryElevationComponent
SpellCheckComponent

org.apache.solr.highlight:
DefaultSolrHighlighter

org.apache.solr.spelling:
SpellingQueryConverter



 convert the rest of solr to use the new tokenstream API
 ---

 Key: SOLR-1657
 URL: https://issues.apache.org/jira/browse/SOLR-1657
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, 
 SOLR-1657.patch, SOLR-1657_part2.patch


 org.apache.solr.analysis:
 -BufferedTokenStream-
  - -CommonGramsFilter-
  - -CommonGramsQueryFilter-
  - -RemoveDuplicatesTokenFilter-
 -CapitalizationFilterFactory-
 -HyphenatedWordsFilter-
 -LengthFilter (deprecated, remove)-
 SynonymFilter
 SynonymFilterFactory
 -WordDelimiterFilter-
 -org.apache.solr.handler:-
 -AnalysisRequestHandler-
 -AnalysisRequestHandlerBase-
 -org.apache.solr.handler.component:-
 -QueryElevationComponent-
 -SpellCheckComponent-
 -org.apache.solr.highlight:-
 -DefaultSolrHighlighter-
 -org.apache.solr.spelling:-
 -SpellingQueryConverter-

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1813) Support Arabic PDF extraction

2010-03-08 Thread Robert Muir (JIRA)
Support Arabic PDF extraction
-

 Key: SOLR-1813
 URL: https://issues.apache.org/jira/browse/SOLR-1813
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4
Reporter: Robert Muir


Extraction of Arabic text from PDF files is supported by tika/pdfbox, but we 
don't have the optional dependency to do it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1813) Support Arabic PDF extraction

2010-03-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1813:
--

Attachment: SOLR-1813.patch

attached is a patch with a testcase.

i can shrink the icu4j jar file if this is needed.

I will attach the test pdf separately.

 Support Arabic PDF extraction
 -

 Key: SOLR-1813
 URL: https://issues.apache.org/jira/browse/SOLR-1813
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4
Reporter: Robert Muir
 Attachments: arabic.pdf, SOLR-1813.patch


 Extraction of Arabic text from PDF files is supported by tika/pdfbox, but we 
 don't have the optional dependency to do it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1813) Support Arabic PDF extraction

2010-03-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1813:
--

Attachment: arabic.pdf

the pdf file for contrib/extraction/src/test/resources/arabic.pdf

 Support Arabic PDF extraction
 -

 Key: SOLR-1813
 URL: https://issues.apache.org/jira/browse/SOLR-1813
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4
Reporter: Robert Muir
 Attachments: arabic.pdf, SOLR-1813.patch


 Extraction of Arabic text from PDF files is supported by tika/pdfbox, but we 
 don't have the optional dependency to do it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1813) Support Arabic PDF extraction

2010-03-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1813:
--

Attachment: icu4j-4_2_1.jar

the icu4j jar file that goes in contrib/extraction/lib

 Support Arabic PDF extraction
 -

 Key: SOLR-1813
 URL: https://issues.apache.org/jira/browse/SOLR-1813
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4
Reporter: Robert Muir
 Attachments: arabic.pdf, icu4j-4_2_1.jar, SOLR-1813.patch


 Extraction of Arabic text from PDF files is supported by tika/pdfbox, but we 
 don't have the optional dependency to do it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Solr Performance and Scalability

2010-02-11 Thread Robert Muir
Tom, this is really completely unrelated, but given that you have such huge
documents and I see you have exceeded term count limits in lucene, i can't
help but wonder if you have ever considered Andrzej's index pruning patch?
(it is simply a tool you can run on your index)

depending upon requirements, seems like it might be a good fit.

http://issues.apache.org/jira/browse/LUCENE-1812

On Thu, Feb 11, 2010 at 3:11 PM, Tom Burton-West tburtonw...@gmail.comwrote:


 The HathiTrust Large Search indexes the OCR from 5 million volumes, with an
 average of 200-300 pages per volume. So the total number of pages indexed
 would be over 1 billion. However, we are not using pages as Solr documents,
 we are using the entire book, so we only have 5 million rather than 1
 billion Solr documents.

 We also are not storing the OCRed text.  Since the total size of the index
 for 5 million volumes is over 2 terrabytes, we split the index into 10
 shards, each indexing about 1/2 million documents.

 Given all that, our indexes are about 250-300GB for each 500,000 books.
 About 85% of that is the *prx position index.   Unless you have enough
 memory on the OS to get a significant amount of the index into the disk OS
 cache, disk I/O is the big bottleneck, especially for phrase queries with
 common words.
  See   http://www.hathitrust.org/blogs/large-scale-search
 http://www.hathitrust.org/blogs/large-scale-search  for more details.

 Have you considered storing the OCR separately rather than in the Solr
 index
 or does your use case require storing the OCR in the index?


 Tom Burton-West
 Digital Library Production Service
 University of Michigan



 Wick2804 wrote:
 
  We are thinking of creating a Lucene Solr project to store 50million full
  text OCRed A4 pages. Is there anyone out there who could provide some
 kind
  of guidance on the size of index we are likely to generate, and are there
  any gotchas in the standard analysis engines for load and query that will
  cause us issues. Do large indexes cause memory issues on servers?  Any
  help or advice greatly appreciated.
 

 --
 View this message in context:
 http://old.nabble.com/Solr-Performance-and-Scalability-tp27552013p27553353.html
 Sent from the Solr - Dev mailing list archive at Nabble.com.




-- 
Robert Muir
rcm...@gmail.com


Re: Timeline of upgrade to Lucene 3.0.

2010-02-05 Thread Robert Muir
Hi Dawid,

here is the jira issue where you can track the status of getting off
deprecated 2.9 APIs: https://issues.apache.org/jira/browse/SOLR-1659

On Fri, Feb 5, 2010 at 5:40 AM, Dawid Weiss dawid.we...@gmail.com wrote:

 Hi there,

 Is there any upgrade path to Lucene 3.0 in the plans? I ask because
 the head of Carrot2 is using Lucene 3.0 (and there are certain
 incompatible API signatures, as it turns out). An upgrade to the next
 planned Carrot2 3.2.0 release would bring the benefit of not having to
 download external JARs (we replaced all the LGPL code and persuaded
 simple-xml author to switch to the Apache license).

 Corresponding Carrot2 issue for this is here:
 http://issues.carrot2.org/browse/CARROT-623

 Dawid




-- 
Robert Muir
rcm...@gmail.com


[jira] Created: (SOLR-1760) convert synonymsfilter to new tokenstream API

2010-02-05 Thread Robert Muir (JIRA)
convert synonymsfilter to new tokenstream API
-

 Key: SOLR-1760
 URL: https://issues.apache.org/jira/browse/SOLR-1760
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Reporter: Robert Muir


This is the other non-trival tokenstream to convert to the new API. I looked at 
this again today, and think I have a design where it will be nice and efficient.

if you have ideas or are already looking at it, please comment!! I havent 
started coding and we shouldn't duplicate any efforts.

here is my current design:

* add a variable 'maximumContext' to SynonymMap. This is simply the maximum 
singleMatch.size(), its the maximum number of tokens lookahead that is ever 
needed.
* save/restoreState/cloning can be minimized by using a stack (fixed array of 
maximumContext) of references to the SynonymMap submaps. This way we can 
backtrack efficiently for multiword matches without save/restoreState and less 
comparisons.
* two queues (can be fixed arrays of maximumContext) are needed still for 
placing state objects. the first is those that have been evaluated (always 
empty in the case of !preserveOriginal), and the second is those that havent 
yet been evaluated, but are queued due to lookahead. 

i plan on coding this up soon, if you have a better idea or have started work, 
please comment.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API

2010-02-04 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1657:
--

Attachment: SOLR-1657.patch

Chris's patch, except also implement BufferedTokenStream. its marked 
deprecated, its api cannot support custom attributes (so the 6 are simply 
copied into tokens and back), and its unused in solr with this patch.

 convert the rest of solr to use the new tokenstream API
 ---

 Key: SOLR-1657
 URL: https://issues.apache.org/jira/browse/SOLR-1657
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, 
 SOLR-1657.patch


 org.apache.solr.analysis:
 BufferedTokenStream
  - -CommonGramsFilter-
  - -CommonGramsQueryFilter-
  - -RemoveDuplicatesTokenFilter-
 -CapitalizationFilterFactory-
 -HyphenatedWordsFilter-
 -LengthFilter (deprecated, remove)-
 SynonymFilter
 SynonymFilterFactory
 -WordDelimiterFilter-
 org.apache.solr.handler:
 AnalysisRequestHandler
 AnalysisRequestHandlerBase
 org.apache.solr.handler.component:
 QueryElevationComponent
 SpellCheckComponent
 org.apache.solr.highlight:
 DefaultSolrHighlighter
 org.apache.solr.search:
 FieldQParserPlugin
 org.apache.solr.spelling:
 SpellingQueryConverter

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API

2010-02-04 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1657:
--

Description: 
org.apache.solr.analysis:
-BufferedTokenStream-
 - -CommonGramsFilter-
 - -CommonGramsQueryFilter-
 - -RemoveDuplicatesTokenFilter-
-CapitalizationFilterFactory-
-HyphenatedWordsFilter-
-LengthFilter (deprecated, remove)-
SynonymFilter
SynonymFilterFactory
-WordDelimiterFilter-

org.apache.solr.handler:
AnalysisRequestHandler
AnalysisRequestHandlerBase

org.apache.solr.handler.component:
QueryElevationComponent
SpellCheckComponent

org.apache.solr.highlight:
DefaultSolrHighlighter

org.apache.solr.spelling:
SpellingQueryConverter


  was:
org.apache.solr.analysis:
BufferedTokenStream
 - -CommonGramsFilter-
 - -CommonGramsQueryFilter-
 - -RemoveDuplicatesTokenFilter-
-CapitalizationFilterFactory-
-HyphenatedWordsFilter-
-LengthFilter (deprecated, remove)-
SynonymFilter
SynonymFilterFactory
-WordDelimiterFilter-

org.apache.solr.handler:
AnalysisRequestHandler
AnalysisRequestHandlerBase

org.apache.solr.handler.component:
QueryElevationComponent
SpellCheckComponent

org.apache.solr.highlight:
DefaultSolrHighlighter

org.apache.solr.search:
FieldQParserPlugin

org.apache.solr.spelling:
SpellingQueryConverter



 convert the rest of solr to use the new tokenstream API
 ---

 Key: SOLR-1657
 URL: https://issues.apache.org/jira/browse/SOLR-1657
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, 
 SOLR-1657.patch


 org.apache.solr.analysis:
 -BufferedTokenStream-
  - -CommonGramsFilter-
  - -CommonGramsQueryFilter-
  - -RemoveDuplicatesTokenFilter-
 -CapitalizationFilterFactory-
 -HyphenatedWordsFilter-
 -LengthFilter (deprecated, remove)-
 SynonymFilter
 SynonymFilterFactory
 -WordDelimiterFilter-
 org.apache.solr.handler:
 AnalysisRequestHandler
 AnalysisRequestHandlerBase
 org.apache.solr.handler.component:
 QueryElevationComponent
 SpellCheckComponent
 org.apache.solr.highlight:
 DefaultSolrHighlighter
 org.apache.solr.spelling:
 SpellingQueryConverter

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

2010-02-03 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829092#action_12829092
 ] 

Robert Muir commented on SOLR-1670:
---

bq. Order of overlapping tokens is unimportant in every TokenFilter used in 
Solr that I know about. Order-sensitivity is the exception, no?

I guess all along my problem is that tokenstreams are ordered by definition. if 
this order does not matter, a test that uses actual queries would make more 
sense.

The problem was that the test constructions previously used by this filter were 
used in other places where they really shouldn't have been, and the laxness hid 
real bugs (such as this very issue itself!!!).
This is all I am trying to avoid. There is nothing wrong with Steven's 
patch/test construction, just trying to err on the side of caution.


 synonymfilter/map repeat bug
 

 Key: SOLR-1670
 URL: https://issues.apache.org/jira/browse/SOLR-1670
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Affects Versions: 1.4
Reporter: Robert Muir
Assignee: Yonik Seeley
 Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch


 as part of converting tests for SOLR-1657, I ran into a problem with 
 synonymfilter
 the test for 'repeats' has a flaw, it uses this assertTokEqual construct 
 which does not really validate that two lists of token are equal, it just 
 stops at the shorted one.
 {code}
 // repeats
 map.add(strings(a b), tokens(ab), orig, merge);
 map.add(strings(a b), tokens(ab), orig, merge);
 assertTokEqual(getTokList(map,a b,false), tokens(ab));
 /* in reality the result from getTokList is ab ab ab! */
 {code}
 when converted to assertTokenStreamContents this problem surfaced. attached 
 is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

2010-02-03 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829097#action_12829097
 ] 

Robert Muir commented on SOLR-1670:
---

bq. Not at the semantic level (for overlapping tokens).

Another way to look at it is that a tokenstream is just a sequence of tokens, 
and posInc is just another attribute.

your description of semantics makes sense in terms of how it is used by the 
indexer, but the order of these tokens can matter if someone uses a custom 
tokenfilter, it might matter for some custom attributes, and it might matter 
for a different consumer, its different behavior. i have made an effort to 
preserve all the behavior of all these  tokenstreams when converting to the new 
api. I really don't want to break anything.


 synonymfilter/map repeat bug
 

 Key: SOLR-1670
 URL: https://issues.apache.org/jira/browse/SOLR-1670
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Affects Versions: 1.4
Reporter: Robert Muir
Assignee: Yonik Seeley
 Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch


 as part of converting tests for SOLR-1657, I ran into a problem with 
 synonymfilter
 the test for 'repeats' has a flaw, it uses this assertTokEqual construct 
 which does not really validate that two lists of token are equal, it just 
 stops at the shorted one.
 {code}
 // repeats
 map.add(strings(a b), tokens(ab), orig, merge);
 map.add(strings(a b), tokens(ab), orig, merge);
 assertTokEqual(getTokList(map,a b,false), tokens(ab));
 /* in reality the result from getTokList is ab ab ab! */
 {code}
 when converted to assertTokenStreamContents this problem surfaced. attached 
 is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

2010-02-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828204#action_12828204
 ] 

Robert Muir commented on SOLR-1670:
---

bq. I left in place the existing test method, which requires the specified 
order.

is it possible to only expose the 'unsorted' one to synonyms test (such as in 
the synonyms test file itself, rather than base token stream test case?)

i can't think of another situation where it would make sense, more likely to be 
abused instead

 synonymfilter/map repeat bug
 

 Key: SOLR-1670
 URL: https://issues.apache.org/jira/browse/SOLR-1670
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Affects Versions: 1.4
Reporter: Robert Muir
Assignee: Yonik Seeley
 Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch


 as part of converting tests for SOLR-1657, I ran into a problem with 
 synonymfilter
 the test for 'repeats' has a flaw, it uses this assertTokEqual construct 
 which does not really validate that two lists of token are equal, it just 
 stops at the shorted one.
 {code}
 // repeats
 map.add(strings(a b), tokens(ab), orig, merge);
 map.add(strings(a b), tokens(ab), orig, merge);
 assertTokEqual(getTokList(map,a b,false), tokens(ab));
 /* in reality the result from getTokList is ab ab ab! */
 {code}
 when converted to assertTokenStreamContents this problem surfaced. attached 
 is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug

2010-01-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806833#action_12806833
 ] 

Robert Muir commented on SOLR-1670:
---

Steven, i don't have a problem with your patch (I do not wish to be in the way 
of anyone trying to work on SynonymFilter)

But i want to explain some of where i was coming from.

The main reason i got myself into this mess was to try to add wordnet support 
to solr. However, this is currently not possible without duplicating a lot of 
code.
We need to be really careful about allowing any order, it does matter in some 
situations.
For example, in Lucene's synonymfilter (with wordnet support), it has an option 
to limit the number of expansions (so its like a top-N synonym expansion).
Solr doesnt currently have this, so its N/A for now, but just an example where 
the order suddenly becomes important.

only slightly related: we added some improvements to this assertion in lucene 
recently and found a lot of bugs, better checking for clearAttribute() and end()
at some I would like to port these test improvements over to solr, too. 


 synonymfilter/map repeat bug
 

 Key: SOLR-1670
 URL: https://issues.apache.org/jira/browse/SOLR-1670
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Affects Versions: 1.4
Reporter: Robert Muir
Assignee: Yonik Seeley
 Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch


 as part of converting tests for SOLR-1657, I ran into a problem with 
 synonymfilter
 the test for 'repeats' has a flaw, it uses this assertTokEqual construct 
 which does not really validate that two lists of token are equal, it just 
 stops at the shorted one.
 {code}
 // repeats
 map.add(strings(a b), tokens(ab), orig, merge);
 map.add(strings(a b), tokens(ab), orig, merge);
 assertTokEqual(getTokList(map,a b,false), tokens(ab));
 /* in reality the result from getTokList is ab ab ab! */
 {code}
 when converted to assertTokenStreamContents this problem surfaced. attached 
 is an additional assertion to the existing testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-26 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805187#action_12805187
 ] 

Robert Muir commented on SOLR-1677:
---

bq. 2) Perhaps you should read the StopFilter example i already posted in my 
last comment...

https://issues.apache.org/jira/browse/LUCENE-2094?focusedCommentId=12783932page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12783932

as far as this one goes, i specifically commented before on this not being 
'hidden' by Version (with Solr users in mind) but instead its own option that 
every user should consider, regardless of defaults.

For the stopfilter posInc the user should think it through, its pretty strange, 
like i mention in my comment, that a definite article like 'the' gets a posInc 
bump in one language but not another, simply because it happens to be separated 
by a space.

I guess I could care less what the default is, if you care about such things 
you shouldn't be using the defaults and instead specifying this yourself in the 
schema, and Version has no effect. I can't really defend the whole stopfilter 
posInc thing, as again i think it doesn't make a whole lot of sense, maybe it 
works good for english I guess, I won't argue about it.


 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-01-20 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802979#action_12802979
 ] 

Robert Muir commented on SOLR-1677:
---

bq. The point I was trying to make is that the types of bug fixes we make in 
Lucene are no mathematical absolutes - we're not fixing bugs where 1+1=3.

You are wrong, they are absolutes.
And here are the JIRA issues for stemming bugs, since you didnt take my hint to 
go and actually read them.

LUCENE-2055: I used the snowball tests against these stemmers which claim to 
implement 'snowball algorithm', and they fail. This is an absolute, and the fix 
is to instead use snowball.
LUCENE-2203: I used the snowball tests against these stemmers and they failed. 
Here is Martin Porter's confirmation that these are bugs: 
http://article.gmane.org/gmane.comp.search.snowball/1139

Perhaps you should come up with a better example than stemming, as you don't 
know what you are talking about.  

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   >