from:"Robert Muir"


 [ 
https://issues.apache.org/jira/browse/SOLR-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned SOLR-1874:
-

Assignee: Robert Muir

 optimize patternreplacefilter
 -

 Key: SOLR-1874
 URL: https://issues.apache.org/jira/browse/SOLR-1874
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1874.patch


 We can optimize PatternReplaceFilter:
 * don't need to create Strings since CharTermAttribute implements 
 CharSequence, just match directly against it.
 * reuse the matcher, since CharTermAttribute is reused, too.
 * don't create Strings/waste time in replaceAll/replaceFirst if the term 
 doesn't match the regex at all... check with find() first.
 There is more that could be done to make it faster for terms that do match, 
 but this is simple and a start.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (SOLR-1874) optimize patternreplacefilter


 [ 
https://issues.apache.org/jira/browse/SOLR-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-1874.
---

Resolution: Fixed

Committed revision 932752.

 optimize patternreplacefilter
 -

 Key: SOLR-1874
 URL: https://issues.apache.org/jira/browse/SOLR-1874
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1874.patch


 We can optimize PatternReplaceFilter:
 * don't need to create Strings since CharTermAttribute implements 
 CharSequence, just match directly against it.
 * reuse the matcher, since CharTermAttribute is reused, too.
 * don't create Strings/waste time in replaceAll/replaceFirst if the term 
 doesn't match the regex at all... check with find() first.
 There is more that could be done to make it faster for terms that do match, 
 but this is simple and a start.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (SOLR-1876) Convert all tokenstreams and tests to use CharTermAttribute

Convert all tokenstreams and tests to use CharTermAttribute
---

 Key: SOLR-1876
 URL: https://issues.apache.org/jira/browse/SOLR-1876
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
 Fix For: 3.1


See the improvements in LUCENE-2302.

TermAttribute has been deprecated for flexible indexing, as terms can really be 
anything, as long as they can
be serialized to byte[]. 

For character-terms, a CharTermAttribute has been created, with a more friendly 
API. Additionally this attribute
implements the CharSequence and Appendable interfaces.

We should convert all Solr tokenstreams to use this new attribute.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (SOLR-1876) Convert all tokenstreams and tests to use CharTermAttribute


 [ 
https://issues.apache.org/jira/browse/SOLR-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1876:
--

Attachment: SOLR-1876.patch

This patch does the following:
* Converts all tokenstreams to use CharTermAttribute
* Makes all non-final concrete TokenStreams and Analyzers final (see 
LUCENE-2389)
* enables both lucene and solr assertions when running solr core and contrib 
tests (previously disabled!)

All tests pass, and also pass with the additional assertions if you apply 
LUCENE-2389

 Convert all tokenstreams and tests to use CharTermAttribute
 ---

 Key: SOLR-1876
 URL: https://issues.apache.org/jira/browse/SOLR-1876
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1876.patch


 See the improvements in LUCENE-2302.
 TermAttribute has been deprecated for flexible indexing, as terms can really 
 be anything, as long as they can
 be serialized to byte[]. 
 For character-terms, a CharTermAttribute has been created, with a more 
 friendly API. Additionally this attribute
 implements the CharSequence and Appendable interfaces.
 We should convert all Solr tokenstreams to use this new attribute.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (SOLR-1874) optimize patternreplacefilter

2010-04-09 Thread Robert Muir (JIRA)

optimize patternreplacefilter
-

 Key: SOLR-1874
 URL: https://issues.apache.org/jira/browse/SOLR-1874
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
 Fix For: 3.1
 Attachments: SOLR-1874.patch

We can optimize PatternReplaceFilter:
* don't need to create Strings since CharTermAttribute implements CharSequence, 
just match directly against it.
* reuse the matcher, since CharTermAttribute is reused, too.
* don't create Strings/waste time in replaceAll/replaceFirst if the term 
doesn't match the regex at all... check with find() first.

There is more that could be done to make it faster for terms that do match, but 
this is simple and a start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1874) optimize patternreplacefilter

2010-04-09 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1874:
--

Attachment: SOLR-1874.patch

 optimize patternreplacefilter
 -

 Key: SOLR-1874
 URL: https://issues.apache.org/jira/browse/SOLR-1874
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1874.patch


 We can optimize PatternReplaceFilter:
 * don't need to create Strings since CharTermAttribute implements 
 CharSequence, just match directly against it.
 * reuse the matcher, since CharTermAttribute is reused, too.
 * don't create Strings/waste time in replaceAll/replaceFirst if the term 
 doesn't match the regex at all... check with find() first.
 There is more that could be done to make it faster for terms that do match, 
 but this is simple and a start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1869) RemoveDuplicatesTokenFilter doest have expected behaviour

2010-04-08 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854983#action_12854983
]

Robert Muir commented on SOLR-1869:
---

bq. this all started because the highlighter was highlighting a term at the
same offsets twice,

Perhaps we should fix this directly in DefaultSolrHighlighter? It already has
this TokenStream-sorting filter thats intended to do the following:
{code}
/** Orders Tokens in a window first by their startOffset ascending.
* endOffset is currently ignored.
* This is meant to work around fickleness in the highlighter only. It
* can mess up token positions and should not be used for indexing or querying.
*/
{code}

Maybe the deduplication logic should occur here after it sorts on startOffset?

RemoveDuplicatesTokenFilter doest have expected behaviour
-

Key: SOLR-1869
URL: https://issues.apache.org/jira/browse/SOLR-1869
Project: Solr
Issue Type: New Feature
Components: Schema and Analysis
Reporter: Joe Calderon
Priority: Minor
Attachments: RemoveDupOffsetTokenFilter.java,
RemoveDupOffsetTokenFilterFactory.java, SOLR-1869.patch

the RemoveDuplicatesTokenFilter seems broken as it initializes its map and
attributes at the class level and not within its constructor
in addition i would think the expected behaviour would be to remove identical
terms with the same offset positions, instead it looks like it removes
duplicates based on position increment which wont work when using it after
something like the edgengram filter. when i posted this to the mailing list
even erik hatcher seemed to think thats what this filter was supposed to do...
attaching a patch that has the expected behaviour and initializes variables
in constructor

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1869) RemoveDuplicatesTokenFilter doest have expected behaviour

2010-04-07 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854676#action_12854676
 ] 

Robert Muir commented on SOLR-1869:
---

Joe, the initialization is the same. I simply prefer to do this right where the 
attribute is declared, rather than doing it in the ctor (its the same in 
java!). So this is no problem.

as far as the behavior, the filter is currently correct:
{noformat}
A TokenFilter which filters out Tokens at the same position and Term text as 
the previous token in the stream.
{noformat}

if you want to instead create a filter that removes duplicates across an entire 
field, this is really a completely different filter, but it sounds like a 
useful completely different filter!

Can you instead create a patch for a separate filter with a different name?

I think you can start with this patch, but there are a number of issues with 
this patch though:
* the map/set is never cleared, so it won't work across reusable tokenstreams. 
The map/set should be cleared in reset()
* i would use chararrayset instead of this map, like the current 
RemoveDuplicatesTokenFilter


 RemoveDuplicatesTokenFilter doest have expected behaviour
 -

 Key: SOLR-1869
 URL: https://issues.apache.org/jira/browse/SOLR-1869
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Reporter: Joe Calderon
Priority: Minor
 Attachments: SOLR-1869.patch


 the RemoveDuplicatesTokenFilter seems broken as it initializes its map and 
 attributes at the class level and not within its constructor
 in addition i would think the expected behaviour would be to remove identical 
 terms with the same offset positions, instead it looks like it removes 
 duplicates based on position increment which wont work when using it after 
 something like the edgengram filter. when i posted this to the mailing list 
 even erik hatcher seemed to think thats what this filter was supposed to do...
 attaching a patch that has the expected behaviour and initializes variables 
 in constructor

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1869) RemoveDuplicatesTokenFilter doest have expected behaviour

2010-04-07 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1869:
--

Issue Type: New Feature  (was: Bug)

 RemoveDuplicatesTokenFilter doest have expected behaviour
 -

 Key: SOLR-1869
 URL: https://issues.apache.org/jira/browse/SOLR-1869
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Joe Calderon
Priority: Minor
 Attachments: SOLR-1869.patch


 the RemoveDuplicatesTokenFilter seems broken as it initializes its map and 
 attributes at the class level and not within its constructor
 in addition i would think the expected behaviour would be to remove identical 
 terms with the same offset positions, instead it looks like it removes 
 duplicates based on position increment which wont work when using it after 
 something like the edgengram filter. when i posted this to the mailing list 
 even erik hatcher seemed to think thats what this filter was supposed to do...
 attaching a patch that has the expected behaviour and initializes variables 
 in constructor

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1869) RemoveDuplicatesTokenFilter doest have expected behaviour

2010-04-07 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854712#action_12854712
]

Robert Muir commented on SOLR-1869:
---

bq. The CharArrayMap is more performant in lookup, but you are right, we may
need posincr.

we don't need it for the current implementation, as we clear() the chararrayset
when we encounter a term of posincr 0.
so the set is only a set of seen terms at some position.

RemoveDuplicatesTokenFilter doest have expected behaviour
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1865) ignore byte-order markers in SolrResourceLoader

2010-04-05 Thread Robert Muir (JIRA)

ignore byte-order markers in SolrResourceLoader
---

 Key: SOLR-1865
 URL: https://issues.apache.org/jira/browse/SOLR-1865
 Project: Solr
  Issue Type: Improvement
Reporter: Robert Muir
Priority: Minor
 Fix For: 3.1
 Attachments: SOLR-1865.patch

If you create say a stopwords list with windows notepad or other editors and 
save as UTF-8, 
some of these editors will insert a byte-order marker (zero-width no-break 
space) as the first 
character of the file.

http://www.lucidimagination.com/search/document/5101871231fc95af/is_this_a_bug_of_the_ressourceloader


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1865) ignore byte-order markers in SolrResourceLoader

2010-04-05 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1865:
--

Attachment: SOLR-1865.patch

attached is a patch to ignore BOM's at the beginning of files loaded with 
getLines()


 ignore byte-order markers in SolrResourceLoader
 ---

 Key: SOLR-1865
 URL: https://issues.apache.org/jira/browse/SOLR-1865
 Project: Solr
  Issue Type: Improvement
Reporter: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: SOLR-1865.patch


 If you create say a stopwords list with windows notepad or other editors and 
 save as UTF-8, 
 some of these editors will insert a byte-order marker (zero-width no-break 
 space) as the first 
 character of the file.
 http://www.lucidimagination.com/search/document/5101871231fc95af/is_this_a_bug_of_the_ressourceloader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1860) improve stopwords list handling

2010-04-05 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853684#action_12853684
]

Robert Muir commented on SOLR-1860:
---

bq. Either we can setup a simple export and conversion to the format Solr
currently supports now, and if/when someon updates StopFilterFactory to support
the new format, then we can stop converting when we export

Well, this isn't that big of a deal either way.

In Lucene we have a helper class called WordListLoader that supports loading
this format from an InputStream.

One idea to consider: we could try merging some of what SolrResourceLoader does
with this WordListLoader, then its all tested and in one place.
it appears there might be some duplication of effort here... e.g. how long till
a lucene user complains about UTF-8 bom markers in their stoplists :)

We can still use ant to keep the files in sync automatically from the lucene
copies.

improve stopwords list handling
---

Key: SOLR-1860
URL: https://issues.apache.org/jira/browse/SOLR-1860
Project: Solr
Issue Type: Improvement
Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor

Currently Solr makes it easy to use english stopwords for StopFilter or
CommonGramsFilter.
Recently in lucene, we added stopwords lists (mostly, but not all from
snowball) to all the language analyzers.
So it would be nice if a user can easily specify that they want to use a
french stopword list, and use it for StopFilter or CommonGrams.
The ones from snowball, are however formatted in a different manner than the
others (although in Lucene we have parsers to deal with this).
Additionally, we abstract this from Lucene users by adding a static
getDefaultStopSet to all analyzers.
There are two approaches, the first one I think I prefer the most, but I'm
not sure it matters as long as we have good examples (maybe a foreign
language example schema?)
1. The user would specify something like:
filter class=solr.StopFilterFactory
fromAnalyzer=org.apache.lucene.analysis.FrenchAnalyzer .../
This would just grab the CharArraySet from the FrenchAnalyzer's
getDefaultStopSet method, who cares where it comes from or how its loaded.
2. We add support for snowball-formatted stopwords lists, and the user could
something like:
filter class=solr.StopFilterFactory
words=org/apache/lucene/analysis/snowball/french_stop.txt format=snowball
... /
The disadvantage to this is they have to know where the list is, what format
its in, etc. For example: snowball doesn't provide Romanian or Turkish
stopword lists to go along with their stemmers, so we had to add our own.
Let me know what you guys think, and I will create a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries

2010-04-02 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852811#action_12852811
 ] 

Robert Muir commented on SOLR-1852:
---

Committed the test to trunk: revision 930262.

 enablePositionIncrements=true can cause searches to fail when they are 
 parsed as phrase queries
 -

 Key: SOLR-1852
 URL: https://issues.apache.org/jira/browse/SOLR-1852
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Peter Wolanin
Assignee: Robert Muir
 Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch


 Symptom: searching for a string like a domain name containing a '.', the Solr 
 1.4 analyzer tells me that I will get a match, but when I enter the search 
 either in the client or directly in Solr, the search fails. 
 test string:  Identi.ca
 queries that fail:  IdentiCa, Identi.ca, Identi-ca
 query that matches: Identi ca
 schema in use is:
 http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1
 Screen shots:
 analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
 dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
 dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
 standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
 Whether or not the bug appears is determined by the surrounding text:
 would be great to have support for Identi.ca on the follow block
 fails to match Identi.ca, but putting the content on its own or in another 
 sentence:
 Support Identi.ca
 the search matches.  Testing suggests the word for is the problem, and it 
 looks like the bug occurs when a stop word preceeds a word that is split up 
 using the word delimiter filter.
 Setting enablePositionIncrements=false in the stop filter and reindexing 
 causes the searches to match.
 According to Mark Miller in #solr, this bug appears to be fixed already in 
 Solr trunk, either due to the upgraded lucene or changes to the 
 WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1860) improve stopwords list handling

2010-04-02 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852978#action_12852978
]

Robert Muir commented on SOLR-1860:
---

A third idea from Hoss Man:

We should make it easy to edit these lists like english.
So an idea is to create an intl/ folder or similar under the example with
stopwords_fr.txt, stopwords_de.txt
Additionally we could have a schema-intl.xml with example types 'text_fr',
'text_de', etc setup for various languages.
I like this idea best.

improve stopwords list handling
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1859) speed up indexing for example schema


[ 
https://issues.apache.org/jira/browse/SOLR-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852375#action_12852375
 ] 

Robert Muir commented on SOLR-1859:
---

Any objections? If not I would like to commit later today.

Thanks!

 speed up indexing for example schema
 

 Key: SOLR-1859
 URL: https://issues.apache.org/jira/browse/SOLR-1859
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1859.patch


 The example schema should use the lucene core PorterStemmer (coded in Java by 
 Martin Porter)
  instead of the Snowball one that is auto-generated code.
 Although we have sped up the Snowball stemmer, its still pretty slow and the 
 example should be fast.
 Below is the output of ant test -Dtestcase=TestIndexingPerformance 
 -Dargs=-server -Diter=10
 These results are consistent with large document indexing times that I have 
 seen on large english
 collections with Lucene, we double indexing speed.
 {noformat}
 solr1.5branch:
 iter=10 time=5841 throughput=17120
 iter=10 time=5839 throughput=17126
 iter=10 time=6017 throughput=16619
 trunk (unpatched):
 iter=10 time=4132 throughput=24201
 iter=10 time=4142 throughput=24142
 iter=10 time=4151 throughput=24090
 trunk (patched)
 iter=10 time=2998 throughput=33355
 iter=10 time=3021 throughput=33101
 iter=10 time=3006 throughput=33266
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-1859) speed up indexing for example schema


 [ 
https://issues.apache.org/jira/browse/SOLR-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-1859.
---

Resolution: Fixed

Committed revision 930050.

 speed up indexing for example schema
 

 Key: SOLR-1859
 URL: https://issues.apache.org/jira/browse/SOLR-1859
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1859.patch


 The example schema should use the lucene core PorterStemmer (coded in Java by 
 Martin Porter)
  instead of the Snowball one that is auto-generated code.
 Although we have sped up the Snowball stemmer, its still pretty slow and the 
 example should be fast.
 Below is the output of ant test -Dtestcase=TestIndexingPerformance 
 -Dargs=-server -Diter=10
 These results are consistent with large document indexing times that I have 
 seen on large english
 collections with Lucene, we double indexing speed.
 {noformat}
 solr1.5branch:
 iter=10 time=5841 throughput=17120
 iter=10 time=5839 throughput=17126
 iter=10 time=6017 throughput=16619
 trunk (unpatched):
 iter=10 time=4132 throughput=24201
 iter=10 time=4142 throughput=24142
 iter=10 time=4151 throughput=24090
 trunk (patched)
 iter=10 time=2998 throughput=33355
 iter=10 time=3021 throughput=33101
 iter=10 time=3006 throughput=33266
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1860) improve stopwords list handling

improve stopwords list handling
---

 Key: SOLR-1860
 URL: https://issues.apache.org/jira/browse/SOLR-1860
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor


Currently Solr makes it easy to use english stopwords for StopFilter or 
CommonGramsFilter.
Recently in lucene, we added stopwords lists (mostly, but not all from 
snowball) to all the language analyzers.

So it would be nice if a user can easily specify that they want to use a french 
stopword list, and use it for StopFilter or CommonGrams.

The ones from snowball, are however formatted in a different manner than the 
others (although in Lucene we have parsers to deal with this).
Additionally, we abstract this from Lucene users by adding a static 
getDefaultStopSet to all analyzers.

There are two approaches, the first one I think I prefer the most, but I'm not 
sure it matters as long as we have good examples (maybe a foreign language 
example schema?)

1. The user would specify something like:

 filter class=solr.StopFilterFactory 
fromAnalyzer=org.apache.lucene.analysis.FrenchAnalyzer .../
 This would just grab the CharArraySet from the FrenchAnalyzer's 
getDefaultStopSet method, who cares where it comes from or how its loaded.

2. We add support for snowball-formatted stopwords lists, and the user could 
something like:

filter class=solr.StopFilterFactory 
words=org/apache/lucene/analysis/snowball/french_stop.txt format=snowball 
... /
The disadvantage to this is they have to know where the list is, what format 
its in, etc. For example: snowball doesn't provide Romanian or Turkish
stopword lists to go along with their stemmers, so we had to add our own.

Let me know what you guys think, and I will create a patch.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (SOLR-1740) ShingleFilterFactory improvements


 [ 
https://issues.apache.org/jira/browse/SOLR-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned SOLR-1740:
-

Assignee: Robert Muir

 ShingleFilterFactory improvements
 -

 Key: SOLR-1740
 URL: https://issues.apache.org/jira/browse/SOLR-1740
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 1.5
Reporter: Steven Rowe
Assignee: Robert Muir
Priority: Minor
 Attachments: SOLR-1740.patch


 ShingleFilterFactory should allow specification of minimum shingle size (in 
 addition to maximum shingle size), as well as the separator to use between 
 tokens.  These are implemented at LUCENE-2218.  The attached patch allows 
 ShingleFilterFactory to accept configuration of these items, and includes 
 tests against the new functionality in TestShingleFilterFactory.  
 Solr will have to upgrade to lucene-analyzers-3.1-dev.jar before the attached 
 patch will apply.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1740) ShingleFilterFactory improvements


[ 
https://issues.apache.org/jira/browse/SOLR-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852686#action_12852686
 ] 

Robert Muir commented on SOLR-1740:
---

Now that we are on Lucene 3.1, it seems like it would be useful to add these 
new capabilities to the factory?

 ShingleFilterFactory improvements
 -

 Key: SOLR-1740
 URL: https://issues.apache.org/jira/browse/SOLR-1740
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 1.5
Reporter: Steven Rowe
Assignee: Robert Muir
Priority: Minor
 Attachments: SOLR-1740.patch


 ShingleFilterFactory should allow specification of minimum shingle size (in 
 addition to maximum shingle size), as well as the separator to use between 
 tokens.  These are implemented at LUCENE-2218.  The attached patch allows 
 ShingleFilterFactory to accept configuration of these items, and includes 
 tests against the new functionality in TestShingleFilterFactory.  
 Solr will have to upgrade to lucene-analyzers-3.1-dev.jar before the attached 
 patch will apply.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1312) BufferedTokenStream should use new Lucene 2.9 TokenStream API


[ 
https://issues.apache.org/jira/browse/SOLR-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852687#action_12852687
 ] 

Robert Muir commented on SOLR-1312:
---

Hello, I recommend we cancel this issue.

No Solr tokenstreams extend this BufferedTokenStream API anymore, as it is 
bound to Token and does not support reuse.
Currently this class is marked deprecated in trunk, with a backwards 
compatibility layer.

If we think that an API like this is useful, we should make a new 
BufferedTokenStream-like API that uses AttributeSource
instead of Token, but this API would not support reuse and would not be very 
performant, as it would have to use
cloneAttributes() and copyTo() instead of captureState() and restoreState()


 BufferedTokenStream should use new Lucene 2.9 TokenStream API
 -

 Key: SOLR-1312
 URL: https://issues.apache.org/jira/browse/SOLR-1312
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 1.4
Reporter: Tom Burton-West
Priority: Minor

 Since Solr 1.4 will be using Lucene 2.9, the Solr TokenFilters should 
 probably be updated  to use the Lucene 2.9 TokenStream API.   This issue is 
 to put BufferedTokenStream on the list of Filters that need updating. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1740) ShingleFilterFactory improvements


 [ 
https://issues.apache.org/jira/browse/SOLR-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1740:
--

Attachment: SOLR-1740.patch

Steven's patch, synced to trunk.

I plan to commit shortly, thanks for the configuration tests Steven.

 ShingleFilterFactory improvements
 -

 Key: SOLR-1740
 URL: https://issues.apache.org/jira/browse/SOLR-1740
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 1.5
Reporter: Steven Rowe
Assignee: Robert Muir
Priority: Minor
 Attachments: SOLR-1740.patch, SOLR-1740.patch


 ShingleFilterFactory should allow specification of minimum shingle size (in 
 addition to maximum shingle size), as well as the separator to use between 
 tokens.  These are implemented at LUCENE-2218.  The attached patch allows 
 ShingleFilterFactory to accept configuration of these items, and includes 
 tests against the new functionality in TestShingleFilterFactory.  
 Solr will have to upgrade to lucene-analyzers-3.1-dev.jar before the attached 
 patch will apply.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1740) ShingleFilterFactory improvements


 [ 
https://issues.apache.org/jira/browse/SOLR-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1740:
--

Affects Version/s: (was: 1.5)
   3.1
Fix Version/s: 3.1

 ShingleFilterFactory improvements
 -

 Key: SOLR-1740
 URL: https://issues.apache.org/jira/browse/SOLR-1740
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Steven Rowe
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: SOLR-1740.patch, SOLR-1740.patch


 ShingleFilterFactory should allow specification of minimum shingle size (in 
 addition to maximum shingle size), as well as the separator to use between 
 tokens.  These are implemented at LUCENE-2218.  The attached patch allows 
 ShingleFilterFactory to accept configuration of these items, and includes 
 tests against the new functionality in TestShingleFilterFactory.  
 Solr will have to upgrade to lucene-analyzers-3.1-dev.jar before the attached 
 patch will apply.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-1740) ShingleFilterFactory improvements


 [ 
https://issues.apache.org/jira/browse/SOLR-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-1740.
---

Resolution: Fixed

Committed revision 930163. Thanks Steven!

 ShingleFilterFactory improvements
 -

 Key: SOLR-1740
 URL: https://issues.apache.org/jira/browse/SOLR-1740
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Steven Rowe
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: SOLR-1740.patch, SOLR-1740.patch


 ShingleFilterFactory should allow specification of minimum shingle size (in 
 addition to maximum shingle size), as well as the separator to use between 
 tokens.  These are implemented at LUCENE-2218.  The attached patch allows 
 ShingleFilterFactory to accept configuration of these items, and includes 
 tests against the new functionality in TestShingleFilterFactory.  
 Solr will have to upgrade to lucene-analyzers-3.1-dev.jar before the attached 
 patch will apply.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1857) cleanup and sync analysis with lucene trunk

cleanup and sync analysis with lucene trunk
---

 Key: SOLR-1857
 URL: https://issues.apache.org/jira/browse/SOLR-1857
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
 Fix For: 3.1


Solr works on the lucene trunk, but uses a lot of deprecated APIs.
Additionally two factories are missing, the Keyword and StemmerOverride filters.
The code can be improved with 3.x's generics support, removing casts, etc.
Finally there is some code duplication with lucene, and some cleanup (such as 
deprecating factories for stuff thats deprecated in trunk).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1857) cleanup and sync analysis with lucene trunk


 [ 
https://issues.apache.org/jira/browse/SOLR-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1857:
--

Attachment: SOLR-1857.patch

attached is a regrettably large patch to sync us up, and clean things up a bit.

this removes all use of deprecated lucene APIs, except via things that are now 
deprecated in Solr itself.

All tests pass.

 cleanup and sync analysis with lucene trunk
 ---

 Key: SOLR-1857
 URL: https://issues.apache.org/jira/browse/SOLR-1857
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1857.patch


 Solr works on the lucene trunk, but uses a lot of deprecated APIs.
 Additionally two factories are missing, the Keyword and StemmerOverride 
 filters.
 The code can be improved with 3.x's generics support, removing casts, etc.
 Finally there is some code duplication with lucene, and some cleanup (such as 
 deprecating factories for stuff thats deprecated in trunk).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1857) cleanup and sync analysis with lucene trunk


[ 
https://issues.apache.org/jira/browse/SOLR-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852079#action_12852079
 ] 

Robert Muir commented on SOLR-1857:
---

if no one objects, I would like to commit in a day or two. If anyone wants to 
review, thats great... i know its large...

 cleanup and sync analysis with lucene trunk
 ---

 Key: SOLR-1857
 URL: https://issues.apache.org/jira/browse/SOLR-1857
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1857.patch


 Solr works on the lucene trunk, but uses a lot of deprecated APIs.
 Additionally two factories are missing, the Keyword and StemmerOverride 
 filters.
 The code can be improved with 3.x's generics support, removing casts, etc.
 Finally there is some code duplication with lucene, and some cleanup (such as 
 deprecating factories for stuff thats deprecated in trunk).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (SOLR-1857) cleanup and sync analysis with lucene trunk


 [ 
https://issues.apache.org/jira/browse/SOLR-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned SOLR-1857:
-

Assignee: Robert Muir

 cleanup and sync analysis with lucene trunk
 ---

 Key: SOLR-1857
 URL: https://issues.apache.org/jira/browse/SOLR-1857
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1857.patch


 Solr works on the lucene trunk, but uses a lot of deprecated APIs.
 Additionally two factories are missing, the Keyword and StemmerOverride 
 filters.
 The code can be improved with 3.x's generics support, removing casts, etc.
 Finally there is some code duplication with lucene, and some cleanup (such as 
 deprecating factories for stuff thats deprecated in trunk).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1857) cleanup and sync analysis with lucene trunk


[ 
https://issues.apache.org/jira/browse/SOLR-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852213#action_12852213
 ] 

Robert Muir commented on SOLR-1857:
---

bq. I just did a 5 min review, not line-by-line, but seems fine in general. 

Thanks for the review Yonik, I'll move forward then and commit soon... 

I'll open an issue next for the default schema speedups... looking forward to 
this :)


 cleanup and sync analysis with lucene trunk
 ---

 Key: SOLR-1857
 URL: https://issues.apache.org/jira/browse/SOLR-1857
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1857.patch


 Solr works on the lucene trunk, but uses a lot of deprecated APIs.
 Additionally two factories are missing, the Keyword and StemmerOverride 
 filters.
 The code can be improved with 3.x's generics support, removing casts, etc.
 Finally there is some code duplication with lucene, and some cleanup (such as 
 deprecating factories for stuff thats deprecated in trunk).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries


 [ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned SOLR-1852:
-

Assignee: Robert Muir

 enablePositionIncrements=true can cause searches to fail when they are 
 parsed as phrase queries
 -

 Key: SOLR-1852
 URL: https://issues.apache.org/jira/browse/SOLR-1852
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Peter Wolanin
Assignee: Robert Muir
 Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch


 Symptom: searching for a string like a domain name containing a '.', the Solr 
 1.4 analyzer tells me that I will get a match, but when I enter the search 
 either in the client or directly in Solr, the search fails. 
 test string:  Identi.ca
 queries that fail:  IdentiCa, Identi.ca, Identi-ca
 query that matches: Identi ca
 schema in use is:
 http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1
 Screen shots:
 analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
 dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
 dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
 standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
 Whether or not the bug appears is determined by the surrounding text:
 would be great to have support for Identi.ca on the follow block
 fails to match Identi.ca, but putting the content on its own or in another 
 sentence:
 Support Identi.ca
 the search matches.  Testing suggests the word for is the problem, and it 
 looks like the bug occurs when a stop word preceeds a word that is split up 
 using the word delimiter filter.
 Setting enablePositionIncrements=false in the stop filter and reindexing 
 causes the searches to match.
 According to Mark Miller in #solr, this bug appears to be fixed already in 
 Solr trunk, either due to the upgraded lucene or changes to the 
 WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries


[ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852216#action_12852216
 ] 

Robert Muir commented on SOLR-1852:
---

I'm afraid of WDF, but I don't think I am the only one, and I think it would be 
good to fix this bug.

If no one objects, I'd like to commit these patches (testcase and backport the 
trunk filter) to the 1.5 branch in a few days.



 enablePositionIncrements=true can cause searches to fail when they are 
 parsed as phrase queries
 -

 Key: SOLR-1852
 URL: https://issues.apache.org/jira/browse/SOLR-1852
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Peter Wolanin
Assignee: Robert Muir
 Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch


 Symptom: searching for a string like a domain name containing a '.', the Solr 
 1.4 analyzer tells me that I will get a match, but when I enter the search 
 either in the client or directly in Solr, the search fails. 
 test string:  Identi.ca
 queries that fail:  IdentiCa, Identi.ca, Identi-ca
 query that matches: Identi ca
 schema in use is:
 http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1
 Screen shots:
 analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
 dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
 dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
 standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
 Whether or not the bug appears is determined by the surrounding text:
 would be great to have support for Identi.ca on the follow block
 fails to match Identi.ca, but putting the content on its own or in another 
 sentence:
 Support Identi.ca
 the search matches.  Testing suggests the word for is the problem, and it 
 looks like the bug occurs when a stop word preceeds a word that is split up 
 using the word delimiter filter.
 Setting enablePositionIncrements=false in the stop filter and reindexing 
 causes the searches to match.
 According to Mark Miller in #solr, this bug appears to be fixed already in 
 Solr trunk, either due to the upgraded lucene or changes to the 
 WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries


[ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852234#action_12852234
 ] 

Robert Muir commented on SOLR-1852:
---

Peter it is... but admittedly it has not been in trunk for very long, and WDF 
is pretty complex.

It's a bit scary to backport a rewrite of it for this reason, but at the same 
time, we've got this bug 
and the other config bugs found in SOLR-1706, so I think its the right thing to 
do... 


 enablePositionIncrements=true can cause searches to fail when they are 
 parsed as phrase queries
 -

 Key: SOLR-1852
 URL: https://issues.apache.org/jira/browse/SOLR-1852
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Peter Wolanin
Assignee: Robert Muir
 Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch


 Symptom: searching for a string like a domain name containing a '.', the Solr 
 1.4 analyzer tells me that I will get a match, but when I enter the search 
 either in the client or directly in Solr, the search fails. 
 test string:  Identi.ca
 queries that fail:  IdentiCa, Identi.ca, Identi-ca
 query that matches: Identi ca
 schema in use is:
 http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1
 Screen shots:
 analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
 dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
 dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
 standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
 Whether or not the bug appears is determined by the surrounding text:
 would be great to have support for Identi.ca on the follow block
 fails to match Identi.ca, but putting the content on its own or in another 
 sentence:
 Support Identi.ca
 the search matches.  Testing suggests the word for is the problem, and it 
 looks like the bug occurs when a stop word preceeds a word that is split up 
 using the word delimiter filter.
 Setting enablePositionIncrements=false in the stop filter and reindexing 
 causes the searches to match.
 According to Mark Miller in #solr, this bug appears to be fixed already in 
 Solr trunk, either due to the upgraded lucene or changes to the 
 WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1859) speed up indexing for example schema

speed up indexing for example schema


 Key: SOLR-1859
 URL: https://issues.apache.org/jira/browse/SOLR-1859
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1


The example schema should use the lucene core PorterStemmer (coded in Java by 
Martin Porter)
 instead of the Snowball one that is auto-generated code.

Although we have sped up the Snowball stemmer, its still pretty slow and the 
example should be fast.

Below is the output of ant test -Dtestcase=TestIndexingPerformance 
-Dargs=-server -Diter=10
These results are consistent with large document indexing times that I have 
seen on large english
collections with Lucene, we double indexing speed.

{noformat}
solr1.5branch:
iter=10 time=5841 throughput=17120
iter=10 time=5839 throughput=17126
iter=10 time=6017 throughput=16619

trunk (unpatched):
iter=10 time=4132 throughput=24201
iter=10 time=4142 throughput=24142
iter=10 time=4151 throughput=24090

trunk (patched)
iter=10 time=2998 throughput=33355
iter=10 time=3021 throughput=33101
iter=10 time=3006 throughput=33266
{noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1859) speed up indexing for example schema


 [ 
https://issues.apache.org/jira/browse/SOLR-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1859:
--

Attachment: SOLR-1859.patch

attached is a patch. I fixed every instance for general types like text
in every schema file i could find, including test ones, and commented-out 
instances, too. All tests pass.


 speed up indexing for example schema
 

 Key: SOLR-1859
 URL: https://issues.apache.org/jira/browse/SOLR-1859
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: SOLR-1859.patch


 The example schema should use the lucene core PorterStemmer (coded in Java by 
 Martin Porter)
  instead of the Snowball one that is auto-generated code.
 Although we have sped up the Snowball stemmer, its still pretty slow and the 
 example should be fast.
 Below is the output of ant test -Dtestcase=TestIndexingPerformance 
 -Dargs=-server -Diter=10
 These results are consistent with large document indexing times that I have 
 seen on large english
 collections with Lucene, we double indexing speed.
 {noformat}
 solr1.5branch:
 iter=10 time=5841 throughput=17120
 iter=10 time=5839 throughput=17126
 iter=10 time=6017 throughput=16619
 trunk (unpatched):
 iter=10 time=4132 throughput=24201
 iter=10 time=4142 throughput=24142
 iter=10 time=4151 throughput=24090
 trunk (patched)
 iter=10 time=2998 throughput=33355
 iter=10 time=3021 throughput=33101
 iter=10 time=3006 throughput=33266
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

protwords.txt support in stemmers

2010-03-30 Thread Robert Muir

Hello Solr devs,

One thing we did recently in lucene that I would like to expose in Solr, is
add support for protected words to all stemmers.

So the way this works is that a TokenStream attribute 'KeywordAttribute' is
set, and all the stemfilters know to ignore tokens with this boolean value
set.

We also added two neat tokenfilters that make this easy to use:
* KeywordMarkerFilter: a tokenfilter, that given a set of input words, marks
them as keywords with this attribute so any later stemmer ignores them.
* StemmerOverrideFilter: a tokenfilter, that given a map of input
words-stems, stems them with the dictionary, and marks them as keywords so
any later stemmer ignores them.

We have two choices:
* we could treat this stuff as impl details, and add protwords.txt support
to all stemming factories. we could just wrap the filter with a
keywordmarkerfilter internally.
* we could deprecate the explicit protwords.txt in the few factories that
support it, and instead create a factory for KeywordMarkerFilter.
* we could do something else, e.g. both.

So, to illustrate, by adding a factory for the KeywordMarkerFilter, a user
could do:

filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/
filter class=solr.SomeStemmer/

and get the same effect, instead of having to add support for protwords.txt
to every single stem factory.

I don't really have a personal preference as to how we do it, but it would
be cool to have a plan so we can add these factories and clean a few things
up.

In any event I think we should add a factory for the StemmerOverrideFilter,
so someone can have a text file with exceptions, the dutch handling for
fiets comes to mind.

Thanks

-- 
Robert Muir
rcm...@gmail.com

Re: protwords.txt support in stemmers

2010-03-30 Thread Robert Muir

On Tue, Mar 30, 2010 at 8:33 AM, Yonik Seeley yo...@lucidimagination.comwrote:


 It would also be nice to make the token categories generated by
 tokenizers into tags (like StandardTokenizer's ACRONYM, etc).  A
 tokenizer that detected many of the properties could significantly
 speed up analysis because tokens would not have to be re-analyzed to
 see if they contain mixed case, numbers, hyphens, etc (i.e. the fast
 path for WDF would be checking a bit per token).


I like this idea, but it does seem a little bit dangerous. e.g. the
tokenizer could set one of these values, but if some tokenfilter down the
stream doesnt properly use it, you could introduce bugs (by assuming a word
has no numbers when in fact it now does, due to say, a PatternReplaceFilter)

so i think we would simply end up adding a lot of these redundant checks
back, e.g. you would have to re-analyze the term after any regex replacement
from PatternReplaceFilter to properly set these flags... and it might
introduce a lot of subtle bugs.

-- 
Robert Muir
rcm...@gmail.com

Re: protwords.txt support in stemmers

2010-03-30 Thread Robert Muir

On Tue, Mar 30, 2010 at 8:33 AM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir rcm...@gmail.com wrote:
  We have two choices:
  * we could treat this stuff as impl details, and add protwords.txt
 support
  to all stemming factories. we could just wrap the filter with a
  keywordmarkerfilter internally.
  * we could deprecate the explicit protwords.txt in the few factories that
  support it, and instead create a factory for KeywordMarkerFilter.
  * we could do something else, e.g. both.
 
  So, to illustrate, by adding a factory for the KeywordMarkerFilter, a
 user
  could do:
 
  filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
  filter class=solr.SomeStemmer/
 
  and get the same effect, instead of having to add support for
 protwords.txt
  to every single stem factory.

 Yep, this decomposition seems more powerful.

 Sort of related: for a long time I've had the idea of allowing the
 expression of more complex filter chains that can conditionally
 execute some parts based on tags set by other parts.

 This is straightforward to just hand-code in Java of course, but
 trickier to do well in a declarative setting:

  filter class=solr.Tagger tag=protect words=protwords.txt/
  filter class=solr.SomeStemmer skipTags=protect/

 The idea was to also make this fast by allocating a bit per tag
 (assuming we somehow knew all of the possible ones in a particular
 filter chain) and using a bitfield (long) to set and test.  I was
 planning on using Token.flags before the new analysis attribute stuff
 came into being.

 It would also be nice to make the token categories generated by
 tokenizers into tags (like StandardTokenizer's ACRONYM, etc).  A
 tokenizer that detected many of the properties could significantly
 speed up analysis because tokens would not have to be re-analyzed to
 see if they contain mixed case, numbers, hyphens, etc (i.e. the fast
 path for WDF would be checking a bit per token).

 Anyway, probably something for another day, but I wanted to throw it out
 there.

 -Yonik
 http://www.lucidimagination.com


Sorta unrelated too, but on the same topic of performance, I'd really like
to improve the indexing speed with the example schema, and thats my hidden
motivation here.

I think we've already significantly improved WDF and SnowballPorter
performance in trunk, but if we add this support we could at least consider
switching to the much much faster PorterStemmer in the Lucene core for the
example schema, as it would then support protected words via this mechanism.

Do you have a preferred way to benchmark type text for example? Ideally in
the future the lucene benchmark package could support benchmarking Solr
schema definitions... but we aren't there yet!

-- 
Robert Muir
rcm...@gmail.com

Re: protwords.txt support in stemmers

2010-03-30 Thread Robert Muir

On Tue, Mar 30, 2010 at 10:32 AM, Yonik Seeley
yo...@lucidimagination.comwrote:


 Unfortunately not... it's normally something ad hoc like uploading a
 big CSV file, etc.

 There's also the very simplistic TestIndexingPerformance.
 ant test -Dtestcase=TestIndexingPerformance -Dargs=-server
 -Diter=10; grep throughput
 build/test-results/*TestIndexingPerformance*


Cool, as a quick stab at this, I ran this 3 times on solr 1.5, solr trunk,
and solr trunk with the proposed mod:
The results are consistent with what I have seen indexing large docs with
just lucene, too.

solr1.5branch:
iter=10 time=5841 throughput=17120
iter=10 time=5839 throughput=17126
iter=10 time=6017 throughput=16619

trunk:
iter=10 time=4132 throughput=24201
iter=10 time=4142 throughput=24142
iter=10 time=4151 throughput=24090

trunk: swap Snowball Porter with Core Lucene Porter
iter=10 time=2978 throughput=33579
iter=10 time=2973 throughput=33636
iter=10 time=2925 throughput=34188

-- 
Robert Muir
rcm...@gmail.com

[jira] Updated: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries

2010-03-28 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1852:
--

Attachment: SOLR-1852_testcase.patch

attached is a testcase demonstrating the bug.

The problem is that if you have, for example the lucene.solr, where the is 
a stopword, the Solr 1.4 WordDelimiter bumps the position increment of *both* 
lucene and solr tokens:

* lucene (posInc=2)
* solr (posInc=2)
* lucenesolr (posInc=0)

Instead it should look like:

* lucene (posInc=2)
* solr (posInc=1)
* lucenesolr (posInc=0)

In my opinion the behavior of trunk is correct, and this is a bug. 
But I don't know how to fix just Solr 1.4's WDF in a better way than dropping 
in the entire rewritten WDF...


 enablePositionIncrements=true can cause searches to fail when they are 
 parsed as phrase queries
 -

 Key: SOLR-1852
 URL: https://issues.apache.org/jira/browse/SOLR-1852
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Peter Wolanin
 Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch


 Symptom: searching for a string like a domain name containing a '.', the Solr 
 1.4 analyzer tells me that I will get a match, but when I enter the search 
 either in the client or directly in Solr, the search fails. 
 test string:  Identi.ca
 queries that fail:  IdentiCa, Identi.ca, Identi-ca
 query that matches: Identi ca
 schema in use is:
 http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1
 Screen shots:
 analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
 dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
 dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
 standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
 Whether or not the bug appears is determined by the surrounding text:
 would be great to have support for Identi.ca on the follow block
 fails to match Identi.ca, but putting the content on its own or in another 
 sentence:
 Support Identi.ca
 the search matches.  Testing suggests the word for is the problem, and it 
 looks like the bug occurs when a stop word preceeds a word that is split up 
 using the word delimiter filter.
 Setting enablePositionIncrements=false in the stop filter and reindexing 
 causes the searches to match.
 According to Mark Miller in #solr, this bug appears to be fixed already in 
 Solr trunk, either due to the upgraded lucene or changes to the 
 WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

[
https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir resolved SOLR-1710.
---

Resolution: Fixed
Fix Version/s: 3.1
Assignee: Mark Miller

This was resolved in revision 922957.

convert worddelimiterfilter to new tokenstream API
--

Key: SOLR-1710
URL: https://issues.apache.org/jira/browse/SOLR-1710
Project: Solr
Issue Type: Improvement
Components: Schema and Analysis
Reporter: Robert Muir
Assignee: Mark Miller
Fix For: 3.1

Attachments: SOLR-1710-readable.patch, SOLR-1710-readable.patch,
SOLR-1710.patch, SOLR-1710.patch

This one was a doozy, attached is a patch to convert it to the new
tokenstream API.
Some of the logic was split into WordDelimiterIterator (exposes a
BreakIterator-like api for iterating subwords)
the filter is much more efficient now, no cloning.
before applying the patch, copy the existing WordDelimiterFilter to
OriginalWordDelimiterFilter
the patch includes a testcase (TestWordDelimiterBWComp) which generates
random strings from various subword combinations.
For each random string, it compares output against the existing
WordDelimiterFilter for all 512 combinations of boolean parameters.
NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these
combinations. The bugs discovered in SOLR-1706 are fixed here.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-1657) convert the rest of solr to use the new tokenstream API


 [ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-1657.
---

   Resolution: Fixed
Fix Version/s: 3.1
 Assignee: Mark Miller

This was resolved in revision 922957.

 convert the rest of solr to use the new tokenstream API
 ---

 Key: SOLR-1657
 URL: https://issues.apache.org/jira/browse/SOLR-1657
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
Assignee: Mark Miller
 Fix For: 3.1

 Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, 
 SOLR-1657.patch, SOLR-1657_part2.patch, 
 SOLR-1657_synonyms_ugly_slightly_less_slow.patch, 
 SOLR-1657_synonyms_ugly_slow.patch


 org.apache.solr.analysis:
 -BufferedTokenStream-
  - -CommonGramsFilter-
  - -CommonGramsQueryFilter-
  - -RemoveDuplicatesTokenFilter-
 -CapitalizationFilterFactory-
 -HyphenatedWordsFilter-
 -LengthFilter (deprecated, remove)-
 SynonymFilter
 SynonymFilterFactory
 -WordDelimiterFilter-
 -org.apache.solr.handler:-
 -AnalysisRequestHandler-
 -AnalysisRequestHandlerBase-
 -org.apache.solr.handler.component:-
 -QueryElevationComponent-
 -SpellCheckComponent-
 -org.apache.solr.highlight:-
 -DefaultSolrHighlighter-
 -org.apache.solr.spelling:-
 -SpellingQueryConverter-

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-1706) wrong tokens output from WordDelimiterFilter depending upon options


 [ 
https://issues.apache.org/jira/browse/SOLR-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-1706.
---

   Resolution: Fixed
Fix Version/s: 3.1
 Assignee: Mark Miller

This was resolved in revision 922957.

 wrong tokens output from WordDelimiterFilter depending upon options
 ---

 Key: SOLR-1706
 URL: https://issues.apache.org/jira/browse/SOLR-1706
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Affects Versions: 1.4
Reporter: Robert Muir
Assignee: Mark Miller
 Fix For: 3.1


 below you can see that when I have requested to only output numeric 
 concatenations (not words), some words are still sometimes output, ignoring 
 the options i have provided, and even then, in a very inconsistent way.
 {code}
   assertWdf(Super-Duper-XL500-42-AutoCoder's, 0,0,0,1,0,0,0,0,1, null,
 new String[] { 42, AutoCoder },
 new int[] { 18, 21 },
 new int[] { 20, 30 },
 new int[] { 1, 1 });
   assertWdf(Super-Duper-XL500-42-AutoCoder's-56, 0,0,0,1,0,0,0,0,1, null,
 new String[] { 42, AutoCoder, 56 },
 new int[] { 18, 21, 33 },
 new int[] { 20, 30, 35 },
 new int[] { 1, 1, 1 });
   assertWdf(Super-Duper-XL500-AB-AutoCoder's, 0,0,0,1,0,0,0,0,1, null,
 new String[] {  },
 new int[] {  },
 new int[] {  },
 new int[] {  });
   assertWdf(Super-Duper-XL500-42-AutoCoder's-BC, 0,0,0,1,0,0,0,0,1, null,
 new String[] { 42 },
 new int[] { 18 },
 new int[] { 20 },
 new int[] { 1 });
 {code}
 where assertWdf is 
 {code}
   void assertWdf(String text, int generateWordParts, int generateNumberParts,
   int catenateWords, int catenateNumbers, int catenateAll,
   int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
   int stemEnglishPossessive, CharArraySet protWords, String expected[],
   int startOffsets[], int endOffsets[], String types[], int posIncs[])
   throws IOException {
 TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
 WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
 generateNumberParts, catenateWords, catenateNumbers, catenateAll,
 splitOnCaseChange, preserveOriginal, splitOnNumerics,
 stemEnglishPossessive, protWords);
 assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
 posIncs);
   }
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-1820) Remove custom greek/russian charsets encoding


 [ 
https://issues.apache.org/jira/browse/SOLR-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-1820.
---

   Resolution: Fixed
Fix Version/s: 3.1
 Assignee: Robert Muir

This was resolved in revision 922964.

 Remove custom greek/russian charsets encoding
 -

 Key: SOLR-1820
 URL: https://issues.apache.org/jira/browse/SOLR-1820
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: SOLR-1820.patch


 In Solr 1.4, we deprecated support for 'custom encodings embedded inside 
 unicode'.
 This is where the analyzer in lucene itself did encoding conversions, its 
 better to just let 
 analyzers be analyzers, and leave encoding conversion to Java.
 In order to move to Lucene 3.x, we need to remove this deprecated support, 
 and instead
 issue an error in the factories if you try to do this (instead of a warning).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries


[ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850612#action_12850612
 ] 

Robert Muir commented on SOLR-1852:
---

bq. The changes in the patch originate at SOLR-1706 and SOLR-1657, however I 
don't think it's actually the same bug as SOLR-1706 intended to fix since the 
the admin analyzer interface the generated tokens look correct. 

Yeah, I don't like the situation at all, as its not obvious to me at a glance 
how the trunk impl fixes your problem, but at the same time how this changed 
behavior slipped passed the random tests on SOLR-1710.


 enablePositionIncrements=true can cause searches to fail when they are 
 parsed as phrase queries
 -

 Key: SOLR-1852
 URL: https://issues.apache.org/jira/browse/SOLR-1852
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Peter Wolanin
 Attachments: SOLR-1852.patch


 Symptom: searching for a string like a domain name containing a '.', the Solr 
 1.4 analyzer tells me that I will get a match, but when I enter the search 
 either in the client or directly in Solr, the search fails. 
 test string:  Identi.ca
 queries that fail:  IdentiCa, Identi.ca, Identi-ca
 query that matches: Identi ca
 schema in use is:
 http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1
 Screen shots:
 analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
 dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
 dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
 standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
 Whether or not the bug appears is determined by the surrounding text:
 would be great to have support for Identi.ca on the follow block
 fails to match Identi.ca, but putting the content on its own or in another 
 sentence:
 Support Identi.ca
 the search matches.  Testing suggests the word for is the problem, and it 
 looks like the bug occurs when a stop word preceeds a word that is split up 
 using the word delimiter filter.
 Setting enablePositionIncrements=false in the stop filter and reindexing 
 causes the searches to match.
 According to Mark Miller in #solr, this bug appears to be fixed already in 
 Solr trunk, either due to the upgraded lucene or changes to the 
 WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries


[ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850613#action_12850613
 ] 

Robert Muir commented on SOLR-1852:
---

ok, so your bug relates somehow to how the accumulated position increment gap 
is handled.

This is how your stopword fits into the situation, somehow the new code is 
handling it better  for your case, but perhaps its wrong.

there are quite a few tests in TestWordDelimiter, which it passes, but I'll 
spend some time tonight verifying its correctness before we declare success...

 enablePositionIncrements=true can cause searches to fail when they are 
 parsed as phrase queries
 -

 Key: SOLR-1852
 URL: https://issues.apache.org/jira/browse/SOLR-1852
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Peter Wolanin
 Attachments: SOLR-1852.patch


 Symptom: searching for a string like a domain name containing a '.', the Solr 
 1.4 analyzer tells me that I will get a match, but when I enter the search 
 either in the client or directly in Solr, the search fails. 
 test string:  Identi.ca
 queries that fail:  IdentiCa, Identi.ca, Identi-ca
 query that matches: Identi ca
 schema in use is:
 http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1
 Screen shots:
 analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
 dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
 dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
 standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
 Whether or not the bug appears is determined by the surrounding text:
 would be great to have support for Identi.ca on the follow block
 fails to match Identi.ca, but putting the content on its own or in another 
 sentence:
 Support Identi.ca
 the search matches.  Testing suggests the word for is the problem, and it 
 looks like the bug occurs when a stop word preceeds a word that is split up 
 using the word delimiter filter.
 Setting enablePositionIncrements=false in the stop filter and reindexing 
 causes the searches to match.
 According to Mark Miller in #solr, this bug appears to be fixed already in 
 Solr trunk, either due to the upgraded lucene or changes to the 
 WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: svn commit: r928069 - in /lucene/dev/trunk: lucene/ lucene/backwards/src/test/org/apache/lucene/util/ lucene/contrib/benchmark/src/test/org/apache/lucene/benchmark/ lucene/contrib/benchmark/src/

2010-03-26 Thread Robert Muir

();
 +          } catch(LockReleaseFailedException e) {
 +            // well lets pretend its released anyway
 +          }
 +        }
       } catch (IOException e) {
         throw new RuntimeException(unable to write results, e);
       } finally {
 @@ -227,3 +254,4 @@ public class SolrJUnitResultFormatter im
     sb.append(StringUtils.LINE_SEP);
   }
  }
 +

 Modified: lucene/dev/trunk/solr/build.xml
 URL: 
 http://svn.apache.org/viewvc/lucene/dev/trunk/solr/build.xml?rev=928069r1=928068r2=928069view=diff
 ==
 --- lucene/dev/trunk/solr/build.xml (original)
 +++ lucene/dev/trunk/solr/build.xml Fri Mar 26 21:55:57 2010
 @@ -349,6 +349,7 @@
     pathelement location=${dest}/tests/
     !-- include the solrj classpath and jetty files included in example --
     path refid=compile.classpath.solrj /
 +    pathelement location=${common-solr.dir}/../lucene/build/classes/test 
 /  !-- include some lucene test code --
     pathelement path=${java.class.path}/
   /path


 Modified: lucene/dev/trunk/solr/common-build.xml
 URL: 
 http://svn.apache.org/viewvc/lucene/dev/trunk/solr/common-build.xml?rev=928069r1=928068r2=928069view=diff
 ==
 --- lucene/dev/trunk/solr/common-build.xml (original)
 +++ lucene/dev/trunk/solr/common-build.xml Fri Mar 26 21:55:57 2010
 @@ -103,7 +103,7 @@
   property name=junit.output.dir 
 location=${common-solr.dir}/${dest}/test-results/
   property name=junit.reports 
 location=${common-solr.dir}/${dest}/test-results/reports/
   property name=junit.formatter value=plain/
 -  property name=junit.details.formatter 
 value=org.apache.solr.SolrJUnitResultFormatter/
 +  property name=junit.details.formatter 
 value=org.apache.lucene.util.LuceneJUnitResultFormatter/

   !-- Maven properties --
   property name=maven.build.dir value=${basedir}/build/maven/






-- 
Robert Muir
rcm...@gmail.com

build.xml and lucene test code

2010-03-25 Thread Robert Muir

I noticed that for whatever reason, solr's build.xml doesnt detect if
lucene's test code is out of date.

(I am fooling around with LUCENE-1709 where we will try to do the same
parallel test execution for Lucene, as in Solr, and was moving the
special formatter to lucene when i noticed this).

Don't have any ideas how to fix, but just wanted to mention it so its
not forgotten.

worst case, if/when we resolve LUCENE-1709, you will have to run ant
clean first... but I am sure there is some better ant trickery to
detect this situation, maybe just another task dependency.

-- 
Robert Muir
rcm...@gmail.com

[jira] Commented: (SOLR-1835) speed up and improve tests

2010-03-23 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848586#action_12848586
 ] 

Robert Muir commented on SOLR-1835:
---

committed revision 926470 to newtrunk.

if you have problems, please just revert and I will help debug them.
for future speedups, we should try to move ant logic to common-build.xml and 
re-use it for contribs.
this way, DIH tests etc will run in parallel, too. 


 speed up and improve tests
 --

 Key: SOLR-1835
 URL: https://issues.apache.org/jira/browse/SOLR-1835
 Project: Solr
  Issue Type: Improvement
Reporter: Yonik Seeley
 Fix For: 3.1

 Attachments: SOLR-1835.patch, SOLR-1835_parallel.patch, 
 SOLR-1835_parallel.patch, SOLR-1835_parallel.patch, SOLR-1835_parallel.patch


 General test improvements.
 We should use @BeforeClass where possible to avoid per test method overhead, 
 and reuse lucene test utils where possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1835) speed up and improve tests


 [ 
https://issues.apache.org/jira/browse/SOLR-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1835:
--

Attachment: SOLR-1835_parallel.patch

attached is a patch to parallelize the tests...
improvements can be done, and contrib too (e.g. DIH)

but this drops my test time to 4:42 on the first try.

 speed up and improve tests
 --

 Key: SOLR-1835
 URL: https://issues.apache.org/jira/browse/SOLR-1835
 Project: Solr
  Issue Type: Improvement
Reporter: Yonik Seeley
 Fix For: 3.1

 Attachments: SOLR-1835.patch, SOLR-1835_parallel.patch


 General test improvements.
 We should use @BeforeClass where possible to avoid per test method overhead, 
 and reuse lucene test utils where possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1835) speed up and improve tests


 [ 
https://issues.apache.org/jira/browse/SOLR-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1835:
--

Attachment: SOLR-1835_parallel.patch

updated patch:
* doesnt do parallel for the -Dtestcase= case, but does for all, -Dtestpackage, 
-Dtestpackageroot, etc
* you can make the condition for whether to do parallel or not more complex, 
e.g. nightlies could go sequentially.


 speed up and improve tests
 --

 Key: SOLR-1835
 URL: https://issues.apache.org/jira/browse/SOLR-1835
 Project: Solr
  Issue Type: Improvement
Reporter: Yonik Seeley
 Fix For: 3.1

 Attachments: SOLR-1835.patch, SOLR-1835_parallel.patch, 
 SOLR-1835_parallel.patch


 General test improvements.
 We should use @BeforeClass where possible to avoid per test method overhead, 
 and reuse lucene test utils where possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1835) speed up and improve tests


 [ 
https://issues.apache.org/jira/browse/SOLR-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1835:
--

Attachment: SOLR-1835_parallel.patch

attached is a new patch:
* the output from multiple threads is no longer interleaved
* you need to put ant.jar and ant-junit.jar in example/lib for this patch to 
work. these need to be ant 1.7.1 (lucene needs this version anyway i think)


 speed up and improve tests
 --

 Key: SOLR-1835
 URL: https://issues.apache.org/jira/browse/SOLR-1835
 Project: Solr
  Issue Type: Improvement
Reporter: Yonik Seeley
 Fix For: 3.1

 Attachments: SOLR-1835.patch, SOLR-1835_parallel.patch, 
 SOLR-1835_parallel.patch, SOLR-1835_parallel.patch


 General test improvements.
 We should use @BeforeClass where possible to avoid per test method overhead, 
 and reuse lucene test utils where possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1835) speed up and improve tests


 [ 
https://issues.apache.org/jira/browse/SOLR-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1835:
--

Attachment: SOLR-1835_parallel.patch

there was a stray slash in the previous version.

this caused some people to mistakenly believe they have a faster computer than 
me.


 speed up and improve tests
 --

 Key: SOLR-1835
 URL: https://issues.apache.org/jira/browse/SOLR-1835
 Project: Solr
  Issue Type: Improvement
Reporter: Yonik Seeley
 Fix For: 3.1

 Attachments: SOLR-1835.patch, SOLR-1835_parallel.patch, 
 SOLR-1835_parallel.patch, SOLR-1835_parallel.patch, SOLR-1835_parallel.patch


 General test improvements.
 We should use @BeforeClass where possible to avoid per test method overhead, 
 and reuse lucene test utils where possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: rough outline of where Solr's going

2010-03-18 Thread Robert Muir

On Thu, Mar 18, 2010 at 11:33 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 On version numbering... my inclination would be to let Solr and Lucene
 use their own version numbers (don't sync them up).  I know it'd
 simplify our lives to have the same version across the board, but
 these numbers are really for our users, telling them when big changes
 were made, back compat broken, etc.  I think that trumps dev
 convenience.

Be sure to consider the deprecations removal, its not possible for
Solr to move to Lucene's trunk without this.

Here are two examples of necessary deprecation removals in the branch
so that Solr can use Lucene's trunk:
https://issues.apache.org/jira/browse/SOLR-1820
http://www.lucidimagination.com/search/document/f07da8e4d69f5bfe/removal_of_deprecated_htmlstrip_tokenizer_factories

It seems to be the consensus that people want a major version change
number when this is done.

So this is an example where the version numbers of Solr really do
relate to Lucene, if we want them to share the same trunk.


-- 
Robert Muir
rcm...@gmail.com

Re: rough outline of where Solr's going

2010-03-18 Thread Robert Muir

On Thu, Mar 18, 2010 at 1:12 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 Ahh, OK.

 Meaning Solr will have to remove deprecated support, which means
 Solr's next released version would be a major release?  Ie 2.0?


Its more complex than this. Solr depends on some lucene contrib
modules, which apparently have no backwards-compatibility policy.

I don't think we want to have to suddenly treat all these contrib
modules like core lucene with regards to backwards compat, some of
them haven't reached that level of maturity yet.

On the other hand, exposing contrib's functionality via Solr is a
great way to get more real users and devs giving feedback and
improvements to help them mature.

But we need to work on how to handle some of this: I suppose spatial
is the worst case (don't really know), where Solr has a dependency on
a Lucene contrib specifically labelled as experimental.

-- 
Robert Muir
rcm...@gmail.com

Re: How do I contribute bug fixes

2010-03-18 Thread Robert Muir

On Thu, Mar 18, 2010 at 6:49 PM, Sanjoy Ghosh san...@yahoo.com wrote:
 Hello,

 Can I submit bug fixes?  If so, what is the procedure?

 Thanks,
 Sanjoy

Hello,

Please take a look at this link: http://wiki.apache.org/solr/HowToContribute

-- 
Robert Muir
rcm...@gmail.com

Re: lucene and solr trunk

2010-03-17 Thread Robert Muir

On Wed, Mar 17, 2010 at 9:09 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 Git, Maven, Hg, etc., all sound great for the future, but let's focus
 now on the baby step (how to re-org svn), today, so we can land the
 Solr upgrade work now being done on a branch...


I agree.

Another thing anyone can do to help if they have a spare few minutes,
is to review the technical work done in the branch and provide
feedback.
The big JIRA issue is located at
https://issues.apache.org/jira/browse/SOLR-1659 and other issues are
linked to it.

-- 
Robert Muir
rcm...@gmail.com

Re: lucene and solr trunk

2010-03-17 Thread Robert Muir

On Wed, Mar 17, 2010 at 12:40 PM, Mark Miller markrmil...@gmail.com wrote:
 Okay, so this looks good to me (a few others seemed to like it - though
 Lucene-Dev was somehow dropped earlier) - lets try this out on the branch?
 (then we can get rid of that horrible branch name ;) )

 Anyone on the current branch object to having to do a quick svn switch?


+1

-- 
Robert Muir
rcm...@gmail.com

Re: rough outline of where Solr's going

2010-03-17 Thread Robert Muir

On Wed, Mar 17, 2010 at 8:15 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 My key point being: Version numbers should communicate the
 significance in change to the *user* of the product, and the users of
 Solr are differnet then the users of Lucene-Java, so even if the releases
 happen in lock step, that doesn't mean the verion numbers should be in
 lock step.


As you stated modules were important to think about for svn location,
then it would only make sense that they are important to think about
for release numbering, too.

So lets say we spin off a lucene-analyzers module, it should be 3.1,
too, as its already a module to some degree, and having a
lucene-analyzers-1.0.jar would be downright misleading.

So from this perspective of modules, with solr being a module
alongside lucene, 3.1 makes a lot of sense, and it also makes sense to
try to release things together if possible so that users aren't
confused.


-- 
Robert Muir
rcm...@gmail.com

Re: lucene and solr trunk

2010-03-16 Thread Robert Muir

On Tue, Mar 16, 2010 at 3:43 AM, Simon Willnauer
simon.willna...@googlemail.com wrote:

 One more thing which I wonder about even more is that this whole
 merging happens so quickly for reasons I don't see right now. I don't
 want to keep anybody from making progress but it appears like a rush
 to me.


By the way, the serious changes we applied to the branch, most of them
have been sitting in JIRA over 3 months not doing much: SOLR-1659

if you follow the linked issues, you can see all the stuff that got
put in the branch... the branch was helpful for me, as I could help
Mark with the ton of little things, like TokenStreams embedded
inside JSP files :)

As its just a branch, if you want to go look at those patches
(especially anything I did) and provide technical feedback, that would
be great!

But I think its a mistake to say things are rushed when the work has
been done for months.

-- 
Robert Muir
rcm...@gmail.com

[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0


[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845301#action_12845301
 ] 

Robert Muir commented on SOLR-1804:
---

I wonder if you guys have any insight why the results of this test may have 
changed from 16 to 15 between Lucene 3.0 and Lucene 3.1-dev: 
http://svn.apache.org/viewvc?view=revisionrevision=923048

It did not change between Lucene 2.9 and Lucene 3.0, so I'm concerned about why 
the results would change between 3.0 and 3.1-dev. 

One possible explanation would be if Carrot2 used Version.LUCENE_CURRENT 
somewhere in its code. Any ideas?

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0


[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845451#action_12845451
 ] 

Robert Muir commented on SOLR-1804:
---

Hi Stanislaw:

Correct, I did not upgrade anything else, just lucene. 

I'm sorry its not exactly related to this issue 
(although If we need to upgrade carrot2 to be compatible with Lucene 3.x, then 
thats ok)

My concern is more that we did something in Lucene between 3.0 
and now that caused the results to be different... though again
this could be explained if somewhere in its code Carrot2 uses some
Lucene analysis component, but doesn't hardwire Version to LUCENE_29.

If all else fails I can try to seek out the svn rev # of Lucene that causes 
this change,
by brute force binary search :)

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0


[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845455#action_12845455
 ] 

Robert Muir commented on SOLR-1804:
---

Grant  I am concerned about a possible BW break in Lucene trunk, that is all.
I think its strange that 3.0 and 3.1 jars give different results.

Can you tell me if the clusters are reasonable? here is the output.

{noformat}
junit.framework.AssertionFailedError: number of clusters: [
{labels=[Data Mining Applications], docs=[5, 13, 25, 12, 27],clusters=[]}, 
{labels=[Databases],docs=[15, 21, 7, 17, 11],clusters=[]}, 
{labels=[Knowledge Discovery],docs=[6, 18, 15, 17, 10],clusters=[]}, 
{labels=[Statistical Data Mining],docs=[28, 24, 2, 14],clusters=[]}, 
{labels=[Data Mining Solutions],docs=[5, 22, 8],clusters=[]}, 
{labels=[Data Mining Techniques],docs=[12, 2, 14],clusters=[]}, 
{labels=[Known as Data Mining],docs=[23, 17, 19],clusters=[]}, 
{labels=[Text Mining],docs=[6, 9, 29],clusters=[]}, 
{labels=[Dedicated],docs=[10, 11],clusters=[]}, 
{labels=[Extraction of Hidden Predictive],docs=[3, 11],clusters=[]}, 
{labels=[Information from Large],docs=[3, 7],clusters=[]}, 
{labels=[Neural Networks],docs=[12, 1],clusters=[]}, 
{labels=[Open],docs=[15, 20],clusters=[]}, 
{labels=[Research],docs=[26, 8],clusters=[]}, 
{labels=[Other Topics],docs=[16],clusters=[]}
] expected:16 but was:15
{noformat}

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0


[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845474#action_12845474
 ] 

Robert Muir commented on SOLR-1804:
---

Thanks for the confirmation the clusters are ok.

Well, this is embarrassing, it turns out it is a backwards break, 
though documented, and the culprit is yours truly.

This is the reason it gets different results:
{noformat}
* LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default.
  This means that terms with a position increment gap of zero do not
  affect the norms calculation by default.  (Robert Muir)
{noformat}

I'll change the test to expect 15 clusters with Lucene 3.1, thanks :)

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Robert Muir

Hello,

Is there any concern with removing the deprecated HtmlStrip*Tokenizer factories?

These can be done with CharFilter instead and they have some problems
with lucene's trunk.

If no one objects, I'd like to remove these in the branch.
Otherwise, Uwe tells me there is some way to make them work if need be.

Thanks!

-- 
Robert Muir
rcm...@gmail.com

Re: removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Robert Muir

On Mon, Mar 15, 2010 at 5:30 PM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:

 Is there a way we can fix LUCENE-2098 too?


I think this is good to fix, yet removing the deprecations is
unrelated to this slowdown.

The deprecated functionality (HtmlStrip*Tokenizer) is implemented in
terms of the slower CharFilter, so its not any faster, getting rid of
it won't slow anyone down.

That being said I think we should still try to improve the performance
of this stuff, I agree.

-- 
Robert Muir
rcm...@gmail.com

Re: removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Robert Muir

On Mon, Mar 15, 2010 at 7:18 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 In the case of these factories: can't we eliminate the Html*Tokenizers
 themselves, but make the *factories* return the neccessary *Tokenizer
 wrapped in an HtmlStripCharFilter ?

They would not be able to re-use if you did this, because when you
call reset(Reader) on them, the Reader would not be wrapped.


-- 
Robert Muir
rcm...@gmail.com

Re: removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Robert Muir

On Mon, Mar 15, 2010 at 7:25 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 Hmmm... I'm not sure i understand how any declared CharFilter/TOkenizer
 combo will be able to deal with this any better, but i'll take your word
 for it.

you can see this behavior in SolrAnalyzer's reusableTokenStream
method, it re-uses the Tokenizer but wraps the readers with
charStream() [overridden by TokenizerChain to wrap the Reader with
your CharFilter chain].

  @Override
  public TokenStream reusableTokenStream(String fieldName, Reader
reader) throws IOException {
// if (true) return tokenStream(fieldName, reader);
TokenStreamInfo tsi = (TokenStreamInfo)getPreviousTokenStream();
if (tsi != null) {
  tsi.getTokenizer().reset(charStream(reader)); // -- right here



 Kill it then, and we'll just have to start making a list in the
 Upgrading section of CHANGES.txt noting the recommended upgrad path
 for this (and many, many things to come i imagine)


cool, I'll add some additional verbage to the CHANGES in the branch.



-- 
Robert Muir
rcm...@gmail.com

Re: lucene and solr trunk

2010-03-15 Thread Robert Muir

On Mon, Mar 15, 2010 at 11:43 PM, Mark Miller markrmil...@gmail.com wrote:

 Solr moves to Lucene's trunk:
   /java/trunk, /java/trunk/sol

 +1. With the goal of merged dev, merged tests, this looks the best to me.
 Simple to do patches that span both, simple to setup
 Solr to use Lucene trunk rather than jars. Short paths. Simple. I like it.


+1


-- 
Robert Muir
rcm...@gmail.com

Re: lucene and solr trunk

2010-03-15 Thread Robert Muir

On Tue, Mar 16, 2010 at 12:01 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:
  4) should it be possible for people to check out Lucene-Java w/o
 checking out Solr?

 (i suspect a whole lot of people who only care about the core library are
 going to really adamantly not want to have to check out all of Solr just
 to work on the core)

This wouldn't really be merged development now would it?
When I run 'ant test' I want the Solr tests to run, too.
If one breaks because of a change, I want to look at the source and know why.

-- 
Robert Muir
rcm...@gmail.com

Re: lucene and solr trunk

2010-03-15 Thread Robert Muir

On Tue, Mar 16, 2010 at 12:39 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 And as a committer, you should be concerned about things like this ...
 that doesn't mean every user of Lucene-Java who wants to build from source
 or apply their own local patches is going to feel the same way.


Yep, those users probably already hate our backwards tests and the
contrib tests too.


-- 
Robert Muir
rcm...@gmail.com

[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API


 [ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1657:
--

Attachment: SOLR-1657_synonyms_ugly_slow.patch

A very very ugly, very slow, but simple and conservative conversion of 
SynonymFilter to the new TokenStream API.


 convert the rest of solr to use the new tokenstream API
 ---

 Key: SOLR-1657
 URL: https://issues.apache.org/jira/browse/SOLR-1657
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, 
 SOLR-1657.patch, SOLR-1657_part2.patch, SOLR-1657_synonyms_ugly_slow.patch


 org.apache.solr.analysis:
 -BufferedTokenStream-
  - -CommonGramsFilter-
  - -CommonGramsQueryFilter-
  - -RemoveDuplicatesTokenFilter-
 -CapitalizationFilterFactory-
 -HyphenatedWordsFilter-
 -LengthFilter (deprecated, remove)-
 SynonymFilter
 SynonymFilterFactory
 -WordDelimiterFilter-
 -org.apache.solr.handler:-
 -AnalysisRequestHandler-
 -AnalysisRequestHandlerBase-
 -org.apache.solr.handler.component:-
 -QueryElevationComponent-
 -SpellCheckComponent-
 -org.apache.solr.highlight:-
 -DefaultSolrHighlighter-
 -org.apache.solr.spelling:-
 -SpellingQueryConverter-

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API


 [ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1657:
--

Attachment: SOLR-1657_synonyms_ugly_slightly_less_slow.patch

attached is a less slow version of the above.
it preserves the fast path from the previous code.

 convert the rest of solr to use the new tokenstream API
 ---

 Key: SOLR-1657
 URL: https://issues.apache.org/jira/browse/SOLR-1657
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, 
 SOLR-1657.patch, SOLR-1657_part2.patch, 
 SOLR-1657_synonyms_ugly_slightly_less_slow.patch, 
 SOLR-1657_synonyms_ugly_slow.patch


 org.apache.solr.analysis:
 -BufferedTokenStream-
  - -CommonGramsFilter-
  - -CommonGramsQueryFilter-
  - -RemoveDuplicatesTokenFilter-
 -CapitalizationFilterFactory-
 -HyphenatedWordsFilter-
 -LengthFilter (deprecated, remove)-
 SynonymFilter
 SynonymFilterFactory
 -WordDelimiterFilter-
 -org.apache.solr.handler:-
 -AnalysisRequestHandler-
 -AnalysisRequestHandlerBase-
 -org.apache.solr.handler.component:-
 -QueryElevationComponent-
 -SpellCheckComponent-
 -org.apache.solr.highlight:-
 -DefaultSolrHighlighter-
 -org.apache.solr.spelling:-
 -SpellingQueryConverter-

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1820) Remove custom greek/russian charsets encoding

Remove custom greek/russian charsets encoding
-

 Key: SOLR-1820
 URL: https://issues.apache.org/jira/browse/SOLR-1820
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Reporter: Robert Muir
Priority: Minor


In Solr 1.4, we deprecated support for 'custom encodings embedded inside 
unicode'.

This is where the analyzer in lucene itself did encoding conversions, its 
better to just let 
analyzers be analyzers, and leave encoding conversion to Java.

In order to move to Lucene 3.x, we need to remove this deprecated support, and 
instead
issue an error in the factories if you try to do this (instead of a warning).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1820) Remove custom greek/russian charsets encoding


 [ 
https://issues.apache.org/jira/browse/SOLR-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1820:
--

Attachment: SOLR-1820.patch

Attached is a patch that removes the deprecates bits.
If you try to specify the charset param, instead of a warning you get an error.


 Remove custom greek/russian charsets encoding
 -

 Key: SOLR-1820
 URL: https://issues.apache.org/jira/browse/SOLR-1820
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Reporter: Robert Muir
Priority: Minor
 Attachments: SOLR-1820.patch


 In Solr 1.4, we deprecated support for 'custom encodings embedded inside 
 unicode'.
 This is where the analyzer in lucene itself did encoding conversions, its 
 better to just let 
 analyzers be analyzers, and leave encoding conversion to Java.
 In order to move to Lucene 3.x, we need to remove this deprecated support, 
 and instead
 issue an error in the factories if you try to do this (instead of a warning).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

TestEvaluatorBag

2010-03-14 Thread Robert Muir

Hey guys,

I am seeing a test failure for TestEvaluatorBag...

I wonder if you guys have any ideas, thought it might be my locale,
but i changed it and I still hit it consistently.

Thanks!

-- 
Robert Muir
rcm...@gmail.com

Re: TestEvaluatorBag

2010-03-14 Thread Robert Muir

I think this is a platform-timezone dependent problem.

This is why switching my locale didnt work, because the test started
failing, today in the US we switched to Daylight Savings Time and
somehow the test only fails for people with those timezones.

On Sun, Mar 14, 2010 at 4:46 PM, Robert Muir rcm...@gmail.com wrote:
 Hey guys,

 I am seeing a test failure for TestEvaluatorBag...

 I wonder if you guys have any ideas, thought it might be my locale,
 but i changed it and I still hit it consistently.

 Thanks!

 --
 Robert Muir
 rcm...@gmail.com




-- 
Robert Muir
rcm...@gmail.com

[jira] Commented: (SOLR-1821) Failing testGetDateFormatEvaluator in TestEvaluatorBag


[ 
https://issues.apache.org/jira/browse/SOLR-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845144#action_12845144
 ] 

Robert Muir commented on SOLR-1821:
---

Nice, fixes the issue.

Can you commit this? It would help us in our current work to ensure we are not 
breaking tests.


 Failing testGetDateFormatEvaluator in TestEvaluatorBag
 --

 Key: SOLR-1821
 URL: https://issues.apache.org/jira/browse/SOLR-1821
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Chris Male
 Attachments: SOLR-1821.patch


 On some TimeZones (such as EDT currently), 
 TestEvaluatorBag.testGetDateFormatEvaluator fails with the following error:
 {code:xml}
 org.junit.ComparisonFailure: 
 Expected :2010-03-12 17:15
 Actual   :2010-03-12 18:15
   at org.junit.Assert.assertEquals(Assert.java:96)
   at org.junit.Assert.assertEquals(Assert.java:116)
   at 
 org.apache.solr.handler.dataimport.TestEvaluatorBag.testGetDateFormatEvaluator(TestEvaluatorBag.java:127)
 {code}
 This seems due to the reliance on the System ticks in order to create the 
 Date to compare against.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (SOLR-1821) Failing testGetDateFormatEvaluator in TestEvaluatorBag


 [ 
https://issues.apache.org/jira/browse/SOLR-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned SOLR-1821:
-

Assignee: Robert Muir

 Failing testGetDateFormatEvaluator in TestEvaluatorBag
 --

 Key: SOLR-1821
 URL: https://issues.apache.org/jira/browse/SOLR-1821
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Chris Male
Assignee: Robert Muir
 Attachments: SOLR-1821.patch


 On some TimeZones (such as EDT currently), 
 TestEvaluatorBag.testGetDateFormatEvaluator fails with the following error:
 {code:xml}
 org.junit.ComparisonFailure: 
 Expected :2010-03-12 17:15
 Actual   :2010-03-12 18:15
   at org.junit.Assert.assertEquals(Assert.java:96)
   at org.junit.Assert.assertEquals(Assert.java:116)
   at 
 org.apache.solr.handler.dataimport.TestEvaluatorBag.testGetDateFormatEvaluator(TestEvaluatorBag.java:127)
 {code}
 This seems due to the reliance on the System ticks in order to create the 
 Date to compare against.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-1821) Failing testGetDateFormatEvaluator in TestEvaluatorBag


 [ 
https://issues.apache.org/jira/browse/SOLR-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-1821.
---

   Resolution: Fixed
Fix Version/s: 1.5

Committed revision 922991. 

Thanks Chris!

 Failing testGetDateFormatEvaluator in TestEvaluatorBag
 --

 Key: SOLR-1821
 URL: https://issues.apache.org/jira/browse/SOLR-1821
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 1.5
Reporter: Chris Male
Assignee: Robert Muir
 Fix For: 1.5

 Attachments: SOLR-1821.patch


 On some TimeZones (such as EDT currently), 
 TestEvaluatorBag.testGetDateFormatEvaluator fails with the following error:
 {code:xml}
 org.junit.ComparisonFailure: 
 Expected :2010-03-12 17:15
 Actual   :2010-03-12 18:15
   at org.junit.Assert.assertEquals(Assert.java:96)
   at org.junit.Assert.assertEquals(Assert.java:116)
   at 
 org.apache.solr.handler.dataimport.TestEvaluatorBag.testGetDateFormatEvaluator(TestEvaluatorBag.java:127)
 {code}
 This seems due to the reliance on the System ticks in order to create the 
 Date to compare against.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API

2010-03-13 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1657:
--

Attachment: SOLR-1657_part2.patch

Here's a separate patch (_part2.patch) for all the remaining tokenstreams.

The only one remaining now is SynonymFilter.

For several areas in this patch, I didn't properly change any APIs to fully
support the new Attributes-based API, I just got them off deprecated methods,
still working with Token, and left TODOs.

I figure it would be better to hash this out later on separate issues, where
we modify those APIs to really take advantage of an Attributes-based API.


 convert the rest of solr to use the new tokenstream API
 ---

 Key: SOLR-1657
 URL: https://issues.apache.org/jira/browse/SOLR-1657
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, 
 SOLR-1657.patch, SOLR-1657_part2.patch


 org.apache.solr.analysis:
 -BufferedTokenStream-
  - -CommonGramsFilter-
  - -CommonGramsQueryFilter-
  - -RemoveDuplicatesTokenFilter-
 -CapitalizationFilterFactory-
 -HyphenatedWordsFilter-
 -LengthFilter (deprecated, remove)-
 SynonymFilter
 SynonymFilterFactory
 -WordDelimiterFilter-
 org.apache.solr.handler:
 AnalysisRequestHandler
 AnalysisRequestHandlerBase
 org.apache.solr.handler.component:
 QueryElevationComponent
 SpellCheckComponent
 org.apache.solr.highlight:
 DefaultSolrHighlighter
 org.apache.solr.spelling:
 SpellingQueryConverter

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API

2010-03-13 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1657:
--

Description: 
org.apache.solr.analysis:
-BufferedTokenStream-
 - -CommonGramsFilter-
 - -CommonGramsQueryFilter-
 - -RemoveDuplicatesTokenFilter-
-CapitalizationFilterFactory-
-HyphenatedWordsFilter-
-LengthFilter (deprecated, remove)-
SynonymFilter
SynonymFilterFactory
-WordDelimiterFilter-

-org.apache.solr.handler:-
-AnalysisRequestHandler-
-AnalysisRequestHandlerBase-

-org.apache.solr.handler.component:-
-QueryElevationComponent-
-SpellCheckComponent-

-org.apache.solr.highlight:-
-DefaultSolrHighlighter-

-org.apache.solr.spelling:-
-SpellingQueryConverter-


  was:
org.apache.solr.analysis:
-BufferedTokenStream-
 - -CommonGramsFilter-
 - -CommonGramsQueryFilter-
 - -RemoveDuplicatesTokenFilter-
-CapitalizationFilterFactory-
-HyphenatedWordsFilter-
-LengthFilter (deprecated, remove)-
SynonymFilter
SynonymFilterFactory
-WordDelimiterFilter-

org.apache.solr.handler:
AnalysisRequestHandler
AnalysisRequestHandlerBase

org.apache.solr.handler.component:
QueryElevationComponent
SpellCheckComponent

org.apache.solr.highlight:
DefaultSolrHighlighter

org.apache.solr.spelling:
SpellingQueryConverter



 convert the rest of solr to use the new tokenstream API
 ---

 Key: SOLR-1657
 URL: https://issues.apache.org/jira/browse/SOLR-1657
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, 
 SOLR-1657.patch, SOLR-1657_part2.patch


 org.apache.solr.analysis:
 -BufferedTokenStream-
  - -CommonGramsFilter-
  - -CommonGramsQueryFilter-
  - -RemoveDuplicatesTokenFilter-
 -CapitalizationFilterFactory-
 -HyphenatedWordsFilter-
 -LengthFilter (deprecated, remove)-
 SynonymFilter
 SynonymFilterFactory
 -WordDelimiterFilter-
 -org.apache.solr.handler:-
 -AnalysisRequestHandler-
 -AnalysisRequestHandlerBase-
 -org.apache.solr.handler.component:-
 -QueryElevationComponent-
 -SpellCheckComponent-
 -org.apache.solr.highlight:-
 -DefaultSolrHighlighter-
 -org.apache.solr.spelling:-
 -SpellingQueryConverter-

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1813) Support Arabic PDF extraction

Support Arabic PDF extraction
-

 Key: SOLR-1813
 URL: https://issues.apache.org/jira/browse/SOLR-1813
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4
Reporter: Robert Muir


Extraction of Arabic text from PDF files is supported by tika/pdfbox, but we 
don't have the optional dependency to do it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1813) Support Arabic PDF extraction


 [ 
https://issues.apache.org/jira/browse/SOLR-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1813:
--

Attachment: SOLR-1813.patch

attached is a patch with a testcase.

i can shrink the icu4j jar file if this is needed.

I will attach the test pdf separately.

 Support Arabic PDF extraction
 -

 Key: SOLR-1813
 URL: https://issues.apache.org/jira/browse/SOLR-1813
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4
Reporter: Robert Muir
 Attachments: arabic.pdf, SOLR-1813.patch


 Extraction of Arabic text from PDF files is supported by tika/pdfbox, but we 
 don't have the optional dependency to do it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1813) Support Arabic PDF extraction


 [ 
https://issues.apache.org/jira/browse/SOLR-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1813:
--

Attachment: arabic.pdf

the pdf file for contrib/extraction/src/test/resources/arabic.pdf

 Support Arabic PDF extraction
 -

 Key: SOLR-1813
 URL: https://issues.apache.org/jira/browse/SOLR-1813
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4
Reporter: Robert Muir
 Attachments: arabic.pdf, SOLR-1813.patch


 Extraction of Arabic text from PDF files is supported by tika/pdfbox, but we 
 don't have the optional dependency to do it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1813) Support Arabic PDF extraction