Synchronizing Lucene indexes across 2 application servers

2009-06-18 Thread mitu2009

I've a web application which uses Lucene for search functionality. Lucene
search requests are served by web services sitting on 2 application servers
(IIS 7).The 2 application servers are Load balanced using netscaler.

Both these servers have a batch job running which updates search indexes on
the respective servers in the night on a daily basis.

I need to synchronize search indexes on these 2 servers so that at any point
of time both the servers have uptodate indexes. I was thinking what could be
the best architecture/design strategy to do so given the fact that any of
the 2 application servers could be serving search request depending upon its
availability.

Any inputs please?

Thanks for reading!
-- 
View this message in context: 
http://www.nabble.com/Synchronizing-Lucene-indexes-across-2-application-servers-tp24086961p24086961.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-18 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1693:
--

Attachment: (was: LUCENE-1693.patch)

 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 lucene-1693.patch, TestCompatibility.java, TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of AttributeImpl that
 implements exactly the Attribute interfaces needed. I think this
 should be considered an expert API (addAttributeImpl), as this manual
 optimization is only needed if cloning performance is crucial. I ran
 some quick performance tests using Tee/Sink tokenizers (which do
 cloning) and the performance was roughly 20% faster with the new
 API. I'll run some more performance tests and post more numbers then.
 Note also that when we add serialization to the Attributes, e.g. for
 supporting storing serialized TokenStreams in the index, then the
 serialization should benefit even significantly more from the new API
 than cloning. 
 Also, the TokenStream API does not change, except for the removal 
 of the set/getUseNewAPI methods. So the patches in LUCENE-1460
 should still work.
 All core tests pass, however, I need to update all the documentation
 and also add some unit tests for the new AttributeSource
 functionality. So this patch is not ready to commit yet, but I wanted
 to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-18 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1693:
--

Attachment: LUCENE-1693.patch

Sorry, last patch was invalid (did not compile), I forgot to to revert some 
changes before posting.
Attached patch has still problems in TeeTokenStream, SinkTokenizer and 
CachingTokenFilter (see before), but fixes:
- double cloning of payloads
- the first of your tests works correct, even if i remove next() from 
StopFilter and/or LowercaseFilter

 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of AttributeImpl that
 implements exactly the Attribute interfaces needed. I think this
 should be considered an expert API (addAttributeImpl), as this manual
 optimization is only needed if cloning performance is crucial. I ran
 some quick performance tests using Tee/Sink tokenizers (which do
 cloning) and the performance was roughly 20% faster with the new
 API. I'll run some more performance tests and post more numbers then.
 Note also that when we add serialization to the Attributes, e.g. for
 supporting storing serialized TokenStreams in the index, then the
 serialization should benefit even significantly more from the new API
 than cloning. 
 Also, the TokenStream API does not change, except for the removal 
 of the set/getUseNewAPI methods. So the patches in LUCENE-1460
 should still work.
 All core tests pass, however, I need to update all the documentation
 and also add some unit tests for the new AttributeSource
 functionality. So this patch is not ready to commit yet, but I wanted
 to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] Updated: (LUCENE-1692) Contrib analyzers need tests

2009-06-18 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1692:


Attachment: LUCENE-1692.txt

adds tests for thaianalyzer token offsets and types, both of which have bugs!
tests for correct behavior are included but commented out.


 Contrib analyzers need tests
 

 Key: LUCENE-1692
 URL: https://issues.apache.org/jira/browse/LUCENE-1692
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt


 The analyzers in contrib need tests, preferably ones that test the behavior 
 of all the Token 'attributes' involved (offsets, type, etc) and not just what 
 they do with token text.
 This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-18 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721116#action_12721116
 ] 

Simon Willnauer commented on LUCENE-1696:
-

I will be around and fix / adjust it if it needs some changes. If I do not 
react please send me a ping on this issue. Thanks

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, 
 TestGermanCollation.java


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-18 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721135#action_12721135
 ] 

Michael Busch commented on LUCENE-1693:
---

{quote}
For backwards-compatiblility we should deprecate the current versions of these 
class [and only let them implement next(Token)]. 
{quote}

I agree. With my patch the Tee/Sink stuff doesn't work in all situations 
either, when the new API is used. We need to deprecate tee/sink and write a new 
class that implements the same functionality with the new API.

 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of AttributeImpl that
 implements exactly the Attribute interfaces needed. I think this
 should be considered an expert API (addAttributeImpl), as this manual
 optimization is only needed if cloning performance is crucial. I ran
 some quick performance tests using Tee/Sink tokenizers (which do
 cloning) and the performance was roughly 20% faster with the new
 API. I'll run some more performance tests and post more numbers then.
 Note also that when we add serialization to the Attributes, e.g. for
 supporting storing serialized TokenStreams in the index, then the
 serialization should benefit even significantly more from the new API
 than cloning. 
 Also, the TokenStream API does not change, except for the removal 
 of the set/getUseNewAPI methods. So the patches in LUCENE-1460
 should still work.
 All core tests pass, however, I need to update all the documentation
 and also add some unit tests for the new AttributeSource
 functionality. So this patch is not ready to commit yet, but I wanted
 to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional 

Re: Synchronizing Lucene indexes across 2 application servers

2009-06-18 Thread Michael McCandless
Could you re-ask this on java-user instead?  Thanks!

(java-dev is for discussing how to make changes to Lucene; java-user
is for discussing usage of Lucene).

Mike

On Thu, Jun 18, 2009 at 2:13 AM, mitu2009musicfrea...@gmail.com wrote:

 I've a web application which uses Lucene for search functionality. Lucene
 search requests are served by web services sitting on 2 application servers
 (IIS 7).The 2 application servers are Load balanced using netscaler.

 Both these servers have a batch job running which updates search indexes on
 the respective servers in the night on a daily basis.

 I need to synchronize search indexes on these 2 servers so that at any point
 of time both the servers have uptodate indexes. I was thinking what could be
 the best architecture/design strategy to do so given the fact that any of
 the 2 application servers could be serving search request depending upon its
 availability.

 Any inputs please?

 Thanks for reading!
 --
 View this message in context: 
 http://www.nabble.com/Synchronizing-Lucene-indexes-across-2-application-servers-tp24086961p24086961.html
 Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene 2.9 Again

2009-06-18 Thread Michael McCandless
On Wed, Jun 17, 2009 at 4:13 PM, Mark Millermarkrmil...@gmail.com wrote:
 Michael Busch wrote:

 Everyone who is unhappy with the release TODO's, go back in your mail
 archive to the 2.2 release and check how many tedious little changes we made
 to improve the release quality. And besides the maven stuff, there is not
 really more to do compared to pre-2.2, it's just documented in a more
 verbose (=RM-friendly) way.

 I didn't mean to imply anything untowards :) I'm grateful for the work you
 guys have put into making it all more friendly. I know I have seen many of
 Mike M's wiki updates on this page too, and I've always been sure its for
 the better.

Well, I made lots of silly mistakes during my releases :)  (if you're
not making mistakes, you're not trying hard enough)

So every time I made a mistake I went and updated it.

 Even still, when I look at the process, I remember why I clung to Windows
 for so long :) Now I'm happily on Ubuntu and can still usually avoid such
 fun work :)

The next step after Ubuntu is OS X, of course ;)

 I'll happily soldier on though. I just wish it was all in Java :)

I pretty much find any excuse to go and write stuff in Python ;)  So,
I wrote a Python script that goes  signs/verifies sigs on all the
Maven artifacts.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-18 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721142#action_12721142
 ] 

Uwe Schindler commented on LUCENE-1693:
---

OK, we can merge our patches then! At the moement I see no real show-stoppers 
with the current aproach, have you tested thoroughly and measured performance? 
All tests from core and contrib/analyzers pass, the problems with your last 
TestCompatibility.java are Tee/Sink problems.
The interesting part (if we stay with my not-so-elegant-anymore solution 
because of reflections hacks), would be to remove the deprecated next(Token) 
methods from core streams, which would be a great code cleanup!

 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of AttributeImpl that
 implements exactly the Attribute interfaces needed. I think this
 should be considered an expert API (addAttributeImpl), as this manual
 optimization is only needed if cloning performance is crucial. I ran
 some quick performance tests using Tee/Sink tokenizers (which do
 cloning) and the performance was roughly 20% faster with the new
 API. I'll run some more performance tests and post more numbers then.
 Note also that when we add serialization to the Attributes, e.g. for
 supporting storing serialized TokenStreams in the index, then the
 serialization should benefit even significantly more from the new API
 than cloning. 
 Also, the TokenStream API does not change, except for the removal 
 of the set/getUseNewAPI methods. So the patches in LUCENE-1460
 should still work.
 All core tests pass, however, I need to update all the documentation
 and also add some unit tests for the new AttributeSource
 functionality. So this patch is not ready to commit yet, but I wanted
 to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (LUCENE-1673) Move TrieRange to core

2009-06-18 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1673:
--

Attachment: LUCENE-1673.patch

Final patch version with updated javadocs. I will commit in a day or two :-)
When committing, I will also remove TrieRange from contrib/search (not included 
in patch).

If you want to make javadocs updates, feel free to post an updated patch or do 
it after I committed.

After that I will do some work for NumericField and NumericSortField as well as 
moving the parsers to FieldCache and make the plain-text-number parsers public 
there, too.

 Move TrieRange to core
 --

 Key: LUCENE-1673
 URL: https://issues.apache.org/jira/browse/LUCENE-1673
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch, 
 LUCENE-1673.patch, LUCENE-1673.patch


 TrieRange was iterated many times and seems stable now (LUCENE-1470, 
 LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to 
 its default FieldTypes (SOLR-940) and if possible I want to move it to core 
 before release of 2.9.
 Before this can be done, there are some things to think about:
 # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how 
 should they be called in core? I would suggest to leave it as it is. On the 
 other hand, if this keeps our only numeric query implementation, we could 
 call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here 
 are problems). Same for the TokenStreams and Filters.
 # Maybe the pairs of classes for indexing and searching should be moved into 
 one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The 
 problem here: ctors must be able to pass int, long, double, float as range 
 parameters. For the end user, mixing these 4 types in one class is hard to 
 handle. If somebody forgets to add a L to a long, it suddenly instantiates a 
 int version of range query, hitting no results and so on. Same with other 
 types. Maybe accept java.lang.Number as parameter (because nullable for 
 half-open bounds) and one enum for the type.
 # TrieUtils move into o.a.l.util? or document or?
 # Move TokenStreams into o.a.l.analysis, ShiftAttribute into 
 o.a.l.analysis.tokenattributes? Somewhere else?
 # If we rename the classes, should Solr stay with Trie (because there are 
 different impls)?
 # Maybe add a subclass of AbstractField, that automatically creates these 
 TokenStreams and omits norms/tf per default for easier addition to Document 
 instances?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-18 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721158#action_12721158
 ] 

Uwe Schindler commented on LUCENE-1693:
---

By the way, I tested Solr's token streams also after updating the lucene jar 
file. All tests pass (only some not related ones fail because of latest changes 
in Lucene trunk and some compile failures because of changes in no-released 
APIs).
Solrs TokenStreams are all programmed with the old-api, but they get inverted 
using incrementToken from our patch.
Also the solr query parser seems to work.

 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of AttributeImpl that
 implements exactly the Attribute interfaces needed. I think this
 should be considered an expert API (addAttributeImpl), as this manual
 optimization is only needed if cloning performance is crucial. I ran
 some quick performance tests using Tee/Sink tokenizers (which do
 cloning) and the performance was roughly 20% faster with the new
 API. I'll run some more performance tests and post more numbers then.
 Note also that when we add serialization to the Attributes, e.g. for
 supporting storing serialized TokenStreams in the index, then the
 serialization should benefit even significantly more from the new API
 than cloning. 
 Also, the TokenStream API does not change, except for the removal 
 of the set/getUseNewAPI methods. So the patches in LUCENE-1460
 should still work.
 All core tests pass, however, I need to update all the documentation
 and also add some unit tests for the new AttributeSource
 functionality. So this patch is not ready to commit yet, but I wanted
 to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: 

[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721196#action_12721196
 ] 

Michael McCandless commented on LUCENE-1630:



  * I wonder if we should have a separate TopScorer class, that
doesn't expose nextDoc/advance methods?  And then a separate
QueryWeight.topScorer method instead of a boolean arg to
QueryWeight.scorer.  (I'm torn...).  EG, if you get a topScorer,
you are not supposed to call nextDoc/advance on it, so it really
feels like it wants to be a different class than Scorer...

  * Update CHANGES entry based on iterations on the patch
(eg supportsDocsOutOfOrder -- acceptsDocsOutOfOrder)

  * Can we rename QW.scoresOutOfOrder - QW.scoresDocsOutOfOrder?

  * In IndexSearcher ~line 221 shouldn't was pass true for
scoresDocsInOrder in {{Scorer scorer = weight.scorer(reader, false, true)}}?

  * QyertWeight - QueryWeight

  * I think CustomScoreQuery.scorer should actually always score docs
in order?  So CustomWeight.scoresOutOfOrder should return false?
And CustomWeight.scorer should pass true for scoreDocsInOrder to
all sub-weights?


 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, 
 LUCENE-1630.patch


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding 

[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721204#action_12721204
 ] 

Shai Erera commented on LUCENE-1630:


bq. QyertWeight - QueryWeight

I'll fix. Can you please next time give me a hint on where did you find it? :)

bq. I wonder if we should have a separate TopScorer class

I remember that at some point I suggested to have a score(Searcher, Collector) 
on QueryWeight, and make Scorer.score(Collector) package-private (of course 
we'd need to deprecate first and invent a new name). But then I realized that 
custom weights would still need access to Scorer.score(Collector) if they want 
to use an existing Scorer or something.

Taking Scorer.score(Collector) out of Scorer and into TopScorer is a large 
re-factoring. Are you sure about this? I just think of all the Scorers we have, 
and out there, that need to impl a new class, and possible duplicate a lot of 
code that is today shared between the top-level-scorer and iterator-type-scorer.

I understand what you say so it really feels like it wants to be a different 
class than Scorer - I feel that too. But I don't see a great ROI here.

 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, 
 LUCENE-1630.patch


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding 

[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721208#action_12721208
 ] 

Shai Erera commented on LUCENE-1630:


bq. I think CustomScoreQuery.scorer should actually always score docs in order? 

Why? I don't see that it relies on doc id orderness anywhere. What if its 
subWeight is a BooleanWeight and I use a Collector which accepts docs 
out-of-order? Will I have a problem if I ask for an out-of-order Scorer?

 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, 
 LUCENE-1630.patch


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding overriding these new methods.
 One other optimization that was discussed in LUCENE-1593 is to expose a 
 topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
 will be called, and additionally add a start() method to DISI. That will 
 allow Scorers to initialize either on start() or score(Collector). This was 
 proposed mainly because of BS and BS2 which check if they are initialized in 
 every call to next(), skipTo() and score(). Personally I prefer to see that 
 in a separate issue, following that one (as it might add methods to 
 QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-18 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1693:
--

Attachment: (was: LUCENE-1693.patch)

 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of AttributeImpl that
 implements exactly the Attribute interfaces needed. I think this
 should be considered an expert API (addAttributeImpl), as this manual
 optimization is only needed if cloning performance is crucial. I ran
 some quick performance tests using Tee/Sink tokenizers (which do
 cloning) and the performance was roughly 20% faster with the new
 API. I'll run some more performance tests and post more numbers then.
 Note also that when we add serialization to the Attributes, e.g. for
 supporting storing serialized TokenStreams in the index, then the
 serialization should benefit even significantly more from the new API
 than cloning. 
 Also, the TokenStream API does not change, except for the removal 
 of the set/getUseNewAPI methods. So the patches in LUCENE-1460
 should still work.
 All core tests pass, however, I need to update all the documentation
 and also add some unit tests for the new AttributeSource
 functionality. So this patch is not ready to commit yet, but I wanted
 to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-18 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1693:
--

Attachment: LUCENE-1693.patch

Again an update: Unified the reuseable tokens in the TokenWrapper.delegate. No 
it is always set after each action, so no state changes left out.

 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of AttributeImpl that
 implements exactly the Attribute interfaces needed. I think this
 should be considered an expert API (addAttributeImpl), as this manual
 optimization is only needed if cloning performance is crucial. I ran
 some quick performance tests using Tee/Sink tokenizers (which do
 cloning) and the performance was roughly 20% faster with the new
 API. I'll run some more performance tests and post more numbers then.
 Note also that when we add serialization to the Attributes, e.g. for
 supporting storing serialized TokenStreams in the index, then the
 serialization should benefit even significantly more from the new API
 than cloning. 
 Also, the TokenStream API does not change, except for the removal 
 of the set/getUseNewAPI methods. So the patches in LUCENE-1460
 should still work.
 All core tests pass, however, I need to update all the documentation
 and also add some unit tests for the new AttributeSource
 functionality. So this patch is not ready to commit yet, but I wanted
 to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721276#action_12721276
 ] 

Michael McCandless commented on LUCENE-1630:


bq.  Can you please next time give me a hint on where did you find it? 

OK :)  It's a quick search through the patch file though ;)

bq. Taking Scorer.score(Collector) out of Scorer and into TopScorer is a large 
re-factoring. Are you sure about this? I just think of all the Scorers we have, 
and out there, that need to impl a new class, and possible duplicate a lot of 
code that is today shared between the top-level-scorer and iterator-type-scorer.

I'm definitely not sure about it...

For Scorers that don't have anything special to do when they are top, we'd 
have a default impl (get a non-top Scorer and iterate over it, like 
Scorer.score now does.  So I think the only weight that'd do something 
interesting is BooleanQuery's.

But I agree this is a big change, so let's hold off for now?  With search 
specialization (LUCENE-1594) the difference between being top and being sub 
seems to be more important

{quote}
bq. I think CustomScoreQuery.scorer should actually always score docs in order?

Why? I don't see that it relies on doc id orderness anywhere
{quote}

CustomScorer's nextDoc uses advance on its subScorers.

 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, 
 LUCENE-1630.patch


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add 

Re: Fuzzy search change

2009-06-18 Thread Michael McCandless
This would make an awesome addition to Lucene!

This is similar to how Lucene's spellchecker identifies candidates, if
I understand it right.

Would you be able to port it to java?

Mike

On Thu, Jun 18, 2009 at 7:12 AM, Varun Dhussava...@mapmyindia.com wrote:
 Hi,

 I wrote on this a long time ago, but haven't followed it up. I just finished
 a C++ implementation of a spell check module in my software. I borrowed the
 idea from Xapian. It is to use a trigram index to filter results, and then
 use Edit Distance on the filtered set. Would such a solution be acceptable
 to the Lucene Community? The details of my implementation are as follows:

 1) QDBM data store hash map
 2) Trigram tokenizer on the input string
 3) Data store hash(key,value) = (trigram, keyword_id_listkw1...kwN)
 4) Use trigram tokenizer and match with the trigram index
 5) Get the IDs within the input cutoff
 6) Run Edit Distance on the list and return

 In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit, it
 runs in 0.5 sec with a keyword record count of about 1,000,000 records.
 This is at least 3-4 times less than the current search times on Lucene.

 Since the results can be put in a thread safe hash table structure, the
 trigram search can be distributed over a thread pool also.

 Does this seem like a workable suggestion to the community?

 Regards

 --
 Varun Dhussa
 Product Architect
 CE InfoSystems (P) Ltd
 http://www.mapmyindia.com


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-18 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721281#action_12721281
 ] 

Grant Ingersoll commented on LUCENE-1693:
-

{quote}
By the way, I tested Solr's token streams also after updating the lucene jar 
file. All tests pass (only some not related ones fail because of latest changes 
in Lucene trunk and some compile failures because of changes in no-released 
APIs).
Solrs TokenStreams are all programmed with the old-api, but they get inverted 
using incrementToken from our patch.
Also the solr query parser seems to work. 
{quote}

Did you look at the performance on this?

 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of AttributeImpl that
 implements exactly the Attribute interfaces needed. I think this
 should be considered an expert API (addAttributeImpl), as this manual
 optimization is only needed if cloning performance is crucial. I ran
 some quick performance tests using Tee/Sink tokenizers (which do
 cloning) and the performance was roughly 20% faster with the new
 API. I'll run some more performance tests and post more numbers then.
 Note also that when we add serialization to the Attributes, e.g. for
 supporting storing serialized TokenStreams in the index, then the
 serialization should benefit even significantly more from the new API
 than cloning. 
 Also, the TokenStream API does not change, except for the removal 
 of the set/getUseNewAPI methods. So the patches in LUCENE-1460
 should still work.
 All core tests pass, however, I need to update all the documentation
 and also add some unit tests for the new AttributeSource
 functionality. So this patch is not ready to commit yet, but I wanted
 to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721285#action_12721285
 ] 

Shai Erera commented on LUCENE-1630:


bq. CustomScorer's nextDoc uses advance on its subScorers.

Yeah I noticed that, but thought that out-of-order means a top-scorer usually, 
and then score(Collector) is called. But now I see CustomScorer does not 
implement score(Collector) which means it uses Scorer's, which calls nextDoc() 
and advance().

Regarding TopScorer, it'd need to get a Scorer as input, otherwise what would 
be its default impl for score(Collector)? I thought it should be the current 
one of Scorer's.

Will post a patch soon.

 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, 
 LUCENE-1630.patch


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding overriding these new methods.
 One other optimization that was discussed in LUCENE-1593 is to expose a 
 topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
 will be called, and additionally add a start() method to DISI. That will 
 allow Scorers to initialize either on start() or score(Collector). This was 
 proposed mainly because of BS and BS2 which check if they are initialized in 
 every call to next(), skipTo() and score(). Personally I prefer to see that 
 

[jira] Updated: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-18 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1630:
---

Attachment: LUCENE-1630.patch

Implemented the latest comments, except for TopScorer

 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch, 
 LUCENE-1630.patch, LUCENE-1630.patch


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding overriding these new methods.
 One other optimization that was discussed in LUCENE-1593 is to expose a 
 topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
 will be called, and additionally add a start() method to DISI. That will 
 allow Scorers to initialize either on start() or score(Collector). This was 
 proposed mainly because of BS and BS2 which check if they are initialized in 
 every call to next(), skipTo() and score(). Personally I prefer to see that 
 in a separate issue, following that one (as it might add methods to 
 QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-06-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721311#action_12721311
 ] 

Mark Miller commented on LUCENE-1595:
-

bq. I added readContentSource.alg just for that purpose and ran it over the 
Wikipedia dump. All documents were read successfully.

I figured you probably had, but they won't end up coming after you, they will 
come after me :) As expected, no issues hit yet though.


I'll commit this later today.

 Split DocMaker into ContentSource and DocMaker
 --

 Key: LUCENE-1595
 URL: https://issues.apache.org/jira/browse/LUCENE-1595
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch, 
 LUCENE-1595.patch, LUCENE-1595.patch


 This issue proposes some refactoring to the benchmark package. Today, 
 DocMaker has two roles: collecting documents from a collection and preparing 
 a Document object. These two should actually be split up to ContentSource and 
 DocMaker, which will use a ContentSource instance.
 ContentSource will implement all the methods of DocMaker, like 
 getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
 1591, by having a basic ContentSource that offers input stream services, and 
 wraps a file (for example) with a bzip or gzip streams etc.
 DocMaker will implement the makeDocument methods, reusing DocState etc.
 The idea is that collecting the Enwiki documents, for example, should be the 
 same whether I create documents using DocState, add payloads or index 
 additional metadata. Same goes for Trec and Reuters collections, as well as 
 LineDocMaker.
 In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
 99% the same and 99% different. Most of their differences lie in the way they 
 read the data, while most of the similarity lies in the way they create 
 documents (using DocState).
 That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
 (just the reuse of DocState). Also, other DocMakers do not use that DocState 
 today, something they could have gotten for free with this refactoring 
 proposed.
 So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
 Line, Simple), I can write several DocMakers, such as DocStateMaker, 
 ConfigurableDocMaker (one which accpets all kinds of config options) and 
 custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
 instance and reuse the same DocMaking algorithm with many content sources, as 
 well as the same ContentSource algorithm with many DocMaker implementations.
 This will also give us the opportunity to perf test content sources alone 
 (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
 creating a Document object.
 I've already done so in my code environment (I extend the benchmark package 
 for my application's purposes) and I like the flexibility I have. I think 
 this can be a nice contribution to the benchmark package, which can result in 
 some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-06-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721320#action_12721320
 ] 

Shai Erera commented on LUCENE-1595:


bq. they won't end up coming after you, they will come after me :)

I promise to cover for you if that happens :)

bq. I'll commit this later today.

Thanks !

 Split DocMaker into ContentSource and DocMaker
 --

 Key: LUCENE-1595
 URL: https://issues.apache.org/jira/browse/LUCENE-1595
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch, 
 LUCENE-1595.patch, LUCENE-1595.patch


 This issue proposes some refactoring to the benchmark package. Today, 
 DocMaker has two roles: collecting documents from a collection and preparing 
 a Document object. These two should actually be split up to ContentSource and 
 DocMaker, which will use a ContentSource instance.
 ContentSource will implement all the methods of DocMaker, like 
 getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
 1591, by having a basic ContentSource that offers input stream services, and 
 wraps a file (for example) with a bzip or gzip streams etc.
 DocMaker will implement the makeDocument methods, reusing DocState etc.
 The idea is that collecting the Enwiki documents, for example, should be the 
 same whether I create documents using DocState, add payloads or index 
 additional metadata. Same goes for Trec and Reuters collections, as well as 
 LineDocMaker.
 In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
 99% the same and 99% different. Most of their differences lie in the way they 
 read the data, while most of the similarity lies in the way they create 
 documents (using DocState).
 That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
 (just the reuse of DocState). Also, other DocMakers do not use that DocState 
 today, something they could have gotten for free with this refactoring 
 proposed.
 So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
 Line, Simple), I can write several DocMakers, such as DocStateMaker, 
 ConfigurableDocMaker (one which accpets all kinds of config options) and 
 custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
 instance and reuse the same DocMaking algorithm with many content sources, as 
 well as the same ContentSource algorithm with many DocMaker implementations.
 This will also give us the opportunity to perf test content sources alone 
 (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
 creating a Document object.
 I've already done so in my code environment (I extend the benchmark package 
 for my application's purposes) and I like the flexibility I have. I think 
 this can be a nice contribution to the benchmark package, which can result in 
 some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721326#action_12721326
 ] 

Michael McCandless commented on LUCENE-1673:


Latest patch looks good Uwe!  We can separately tweak the javadocs...

 Move TrieRange to core
 --

 Key: LUCENE-1673
 URL: https://issues.apache.org/jira/browse/LUCENE-1673
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch, 
 LUCENE-1673.patch, LUCENE-1673.patch


 TrieRange was iterated many times and seems stable now (LUCENE-1470, 
 LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to 
 its default FieldTypes (SOLR-940) and if possible I want to move it to core 
 before release of 2.9.
 Before this can be done, there are some things to think about:
 # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how 
 should they be called in core? I would suggest to leave it as it is. On the 
 other hand, if this keeps our only numeric query implementation, we could 
 call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here 
 are problems). Same for the TokenStreams and Filters.
 # Maybe the pairs of classes for indexing and searching should be moved into 
 one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The 
 problem here: ctors must be able to pass int, long, double, float as range 
 parameters. For the end user, mixing these 4 types in one class is hard to 
 handle. If somebody forgets to add a L to a long, it suddenly instantiates a 
 int version of range query, hitting no results and so on. Same with other 
 types. Maybe accept java.lang.Number as parameter (because nullable for 
 half-open bounds) and one enum for the type.
 # TrieUtils move into o.a.l.util? or document or?
 # Move TokenStreams into o.a.l.analysis, ShiftAttribute into 
 o.a.l.analysis.tokenattributes? Somewhere else?
 # If we rename the classes, should Solr stay with Trie (because there are 
 different impls)?
 # Maybe add a subclass of AbstractField, that automatically creates these 
 TokenStreams and omits norms/tf per default for easier addition to Document 
 instances?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1700) LogMergePolicy.findMergesToExpungeDeletes need to get deletes from the SegmentReader

2009-06-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1700:
---

Attachment: LUCENE-1700.patch

Attached patch.

I added a test case showing it, then took that same approach (from LUCENE-1313) 
and the test passes.

I also found that with NRT, because the deletions are applied before
building the CFS after flushing, we wind up holding open both the
non-CFS and CFS files on creating the reader.  So, I changed deletions
to flush after the CFS is built.

I plan to commit in a day or two.


 LogMergePolicy.findMergesToExpungeDeletes need to get deletes from the 
 SegmentReader
 

 Key: LUCENE-1700
 URL: https://issues.apache.org/jira/browse/LUCENE-1700
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1700.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 With LUCENE-1516, deletes are carried over in the SegmentReaders
 which means implementations of
 MergePolicy.findMergesToExpungeDeletes (such as LogMergePolicy)
 need to obtain deletion info from the SR (instead of from the
 SegmentInfo which won't have the information).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1677) Remove GCJ IndexReader specializations

2009-06-18 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721410#action_12721410
 ] 

Hoss Man commented on LUCENE-1677:
--

{quote}
I did ask:

http://www.mail-archive.com/java-u...@lucene.apache.org/msg26726.html

And nobody answered.

So I think we should remove it, and the org.apache.lucene.SegmentReader.class 
system property? Can you post a patch? Thanks.
{quote}

FWIW: Google code search pops up a few uses in publicly available code...
http://www.google.co.uk/codesearch?hl=enlr=q=org.apache.lucene.SegmentReader.class+-package%3Arepos%2Fasf%2Flucene%2Fjavasbtn=Search

What jumps out at me is that apparently older versions of Compass relied on 
this feature ... it looks like Compass 2.0 eliminated the need for this class, 
but i just wanted to point this out.




 Remove GCJ IndexReader specializations
 --

 Key: LUCENE-1677
 URL: https://issues.apache.org/jira/browse/LUCENE-1677
 Project: Lucene - Java
  Issue Type: Task
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9


 These specializations are outdated, unsupported, most probably pointless due 
 to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you 
 are going to ask people on java-user, anybody replied that they need it?). 
 While giving nothing, they make SegmentReader instantiation code look real 
 ugly.
 If nobody objects, I'm going to post a patch that removes these from Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1692) Contrib analyzers need tests

2009-06-18 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721418#action_12721418
 ] 

Robert Muir commented on LUCENE-1692:
-

michael: I'm think I'm done here.

if you consider any of the bugs important just let me know, can try to help get 
them fixed.


 Contrib analyzers need tests
 

 Key: LUCENE-1692
 URL: https://issues.apache.org/jira/browse/LUCENE-1692
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, 
 LUCENE-1692.txt


 The analyzers in contrib need tests, preferably ones that test the behavior 
 of all the Token 'attributes' involved (offsets, type, etc) and not just what 
 they do with token text.
 This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene 2.9 Again

2009-06-18 Thread Jason Rutherglen
 I pretty much find any excuse to go and write stuff in Python

There's Scala...

On Thu, Jun 18, 2009 at 2:37 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Wed, Jun 17, 2009 at 4:13 PM, Mark Millermarkrmil...@gmail.com wrote:
  Michael Busch wrote:
 
  Everyone who is unhappy with the release TODO's, go back in your mail
  archive to the 2.2 release and check how many tedious little changes we
 made
  to improve the release quality. And besides the maven stuff, there is
 not
  really more to do compared to pre-2.2, it's just documented in a more
  verbose (=RM-friendly) way.
 
  I didn't mean to imply anything untowards :) I'm grateful for the work
 you
  guys have put into making it all more friendly. I know I have seen many
 of
  Mike M's wiki updates on this page too, and I've always been sure its for
  the better.

 Well, I made lots of silly mistakes during my releases :)  (if you're
 not making mistakes, you're not trying hard enough)

 So every time I made a mistake I went and updated it.

  Even still, when I look at the process, I remember why I clung to Windows
  for so long :) Now I'm happily on Ubuntu and can still usually avoid such
  fun work :)

 The next step after Ubuntu is OS X, of course ;)

  I'll happily soldier on though. I just wish it was all in Java :)

 I pretty much find any excuse to go and write stuff in Python ;)  So,
 I wrote a Python script that goes  signs/verifies sigs on all the
 Maven artifacts.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-06-18 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721426#action_12721426
 ] 

Uwe Schindler commented on LUCENE-1693:
---

I only tested performance with the lucene benchmarker on the various standard 
analyzers. The tokenizer.alg produces after the patch the same results as 
before in almost the same time (time variations are bigger than differences). 
With an unmodified benchmarker, this is clear, benchmarkers tokenizer task call 
still the deprecated next(Token) and as all core analyzers still implement this 
directly, so there is no wrapping. I modified the tested tokenstreams and 
filters in core, that were used, and removed next(Token) and left only 
incrementToken() avalilable, in this case the speed difference was also not 
measureable in my configuration (Thinkpad T60, Core Duo, Win32). I also changed 
some of the filters to implement next(Token) only, others to only 
incrementToken(), to have a completely mixed old/new API chain, and still the 
same results (and same tokenization results, as seen in generated indexes for 
wikipedia). I also changed the benchmarker to use incrementToken(), which was 
also fine.

To have a small speed incresase (but I was not able to measure it), I changed 
all tokenizers to use only incrementToken for the whole chain and changed the 
benchmarker to also use this method. In this case I was able to 
TokenStream.setOnlyUseNewAPI(true), which removed the 
backwards-compatibility-wrapper and the Token instance, so the chain only used 
the unwrapped simple attributes. In my opinion, tokenization was a little bit 
faster, faster than without any patch and next(Token). When the old API is 
completely removed, this will be the default behaviour.

So I would suggest to review this patch, add some tests for heterogenous 
tokenizer chains and remove all next(...) implementations from all streams and 
filters and only implement incrementToken(). Contrib analyzers should then only 
be rewritten to the new API without the old API.

The mentioned bugs with Tee/Sink are not related to this bug, but are more 
serious now, because the tokenizer chain is no longer fixed to on specfic API 
variant (it supports both mixed together).


 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the 

[jira] Commented: (LUCENE-1646) QueryParser throws new exceptions even if custom parsing logic threw a better one

2009-06-18 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721427#action_12721427
 ] 

Hoss Man commented on LUCENE-1646:
--

As a general rule, code catching an execption and throwing a new exception with 
more details should (almost always) call initCause (unless the new Exception 
has a constructor that takes care of that part) to preserve all of the stack 
history.

Client code that wants to get at the root exception can do so using getCause()

In QueryParser...
{code}
} catch (ParseException tme) {
   // rethrow to include the original query:
   ParseException e = new ParseException(Cannot parse ' +query+ ':  + 
tme.getMessage());
   e.initCause(tme);
   throw e;
}
{code}

In Trejkaz's code, something like...
{code}
} catch (ParseException pexp) {
   for (Throwable t = pexp; null != t; t = t.getCause()) {
  if (t instanceof OurCustomException) {
 takeActionOnCustomException((OurCustomException)t);
  }
  takeActionOnLuceneQueryParserException(exp)
   }
}
{code}

 QueryParser throws new exceptions even if custom parsing logic threw a better 
 one
 -

 Key: LUCENE-1646
 URL: https://issues.apache.org/jira/browse/LUCENE-1646
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4.1
Reporter: Trejkaz

 We have subclassed QueryParser and have various custom fields.  When these 
 fields contain invalid values, we throw a subclass of ParseException which 
 has a more useful message (and also a localised message.)
 Problem is, Lucene's QueryParser is doing this:
 {code}
 catch (ParseException tme) {
 // rethrow to include the original query:
 throw new ParseException(Cannot parse ' +query+ ':  + 
 tme.getMessage());
 }
 {code}
 Thus, our nice and useful ParseException is thrown away, replaced by one with 
 no information about what's actually wrong with the query (it does append 
 getMessage() but that isn't localised.  And it also throws away the 
 underlying cause for the exception.)
 I am about to patch our copy to simply remove these four lines; the caller 
 knows what the query string was (they have to have a copy of it because they 
 are passing it in!) so having it in the error message itself is not useful.  
 Furthermore, when the query string is very big, what the user wants to know 
 is not that the whole query was bad, but which part of it was bad.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-06-18 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller resolved LUCENE-1595.
-

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [New])

Thanks Shai, I just committed this.

 Split DocMaker into ContentSource and DocMaker
 --

 Key: LUCENE-1595
 URL: https://issues.apache.org/jira/browse/LUCENE-1595
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch, 
 LUCENE-1595.patch, LUCENE-1595.patch


 This issue proposes some refactoring to the benchmark package. Today, 
 DocMaker has two roles: collecting documents from a collection and preparing 
 a Document object. These two should actually be split up to ContentSource and 
 DocMaker, which will use a ContentSource instance.
 ContentSource will implement all the methods of DocMaker, like 
 getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
 1591, by having a basic ContentSource that offers input stream services, and 
 wraps a file (for example) with a bzip or gzip streams etc.
 DocMaker will implement the makeDocument methods, reusing DocState etc.
 The idea is that collecting the Enwiki documents, for example, should be the 
 same whether I create documents using DocState, add payloads or index 
 additional metadata. Same goes for Trec and Reuters collections, as well as 
 LineDocMaker.
 In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
 99% the same and 99% different. Most of their differences lie in the way they 
 read the data, while most of the similarity lies in the way they create 
 documents (using DocState).
 That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
 (just the reuse of DocState). Also, other DocMakers do not use that DocState 
 today, something they could have gotten for free with this refactoring 
 proposed.
 So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
 Line, Simple), I can write several DocMakers, such as DocStateMaker, 
 ConfigurableDocMaker (one which accpets all kinds of config options) and 
 custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
 instance and reuse the same DocMaking algorithm with many content sources, as 
 well as the same ContentSource algorithm with many DocMaker implementations.
 This will also give us the opportunity to perf test content sources alone 
 (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
 creating a Document object.
 I've already done so in my code environment (I extend the benchmark package 
 for my application's purposes) and I like the flexibility I have. I think 
 this can be a nice contribution to the benchmark package, which can result in 
 some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene 2.9 Again

2009-06-18 Thread Michael McCandless
On Thu, Jun 18, 2009 at 3:07 PM, Jason
Rutherglenjason.rutherg...@gmail.com wrote:
 I pretty much find any excuse to go and write stuff in Python

 There's Scala...

I've only read about it so far but it does look good.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1692) Contrib analyzers need tests

2009-06-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721451#action_12721451
 ] 

Michael McCandless commented on LUCENE-1692:


bq. michael: I'm think I'm done here.

OK I'll review.  Thanks!!

bq. if you consider any of the bugs important just let me know, can try to help 
get them fixed.

Likely I won't be able to judge the severity of these bugs... so please chime 
in if you think they should be fixed...

 Contrib analyzers need tests
 

 Key: LUCENE-1692
 URL: https://issues.apache.org/jira/browse/LUCENE-1692
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, 
 LUCENE-1692.txt


 The analyzers in contrib need tests, preferably ones that test the behavior 
 of all the Token 'attributes' involved (offsets, type, etc) and not just what 
 they do with token text.
 This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1692) Contrib analyzers need tests

2009-06-18 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721457#action_12721457
 ] 

Robert Muir commented on LUCENE-1692:
-

Michael, I think it would be nice to fix the Thai offset bug, so highlighter 
will work. this is a safe one-line fix and its an obvious error.

The SmartChineseAnalyzer empty token bug is pretty serious, i think indexing 
empty tokens for every piece of punctuation could really hurt similarity 
computation (am i wrong, never tried?)

The Thai .type() bug is something that could be fixed later, i don't think the 
token type being ALPHANUM versus NUM is really hurting anyone.

The issue where DutchAnalyzer doesnt do what it claims, i think thats not 
really hurting anyone, and they can use the snowball version if they want 
accurate snowball behavior.
I do think the huge files in DutchAnalyzer that aren't being used can be 
removed if you want to save 1MB, but I'm not sure how important that is.

Let me know your thoughts. 

 Contrib analyzers need tests
 

 Key: LUCENE-1692
 URL: https://issues.apache.org/jira/browse/LUCENE-1692
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, 
 LUCENE-1692.txt


 The analyzers in contrib need tests, preferably ones that test the behavior 
 of all the Token 'attributes' involved (offsets, type, etc) and not just what 
 they do with token text.
 This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1692) Contrib analyzers need tests

2009-06-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721460#action_12721460
 ] 

Michael McCandless commented on LUCENE-1692:


I'm seeing this test failure:
{code}
[junit] Testcase: 
testBuggyPunctuation(org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer):   
  Caused an ERROR
[junit] null
[junit] java.lang.AssertionError
[junit] at 
org.apache.lucene.analysis.StopFilter.next(StopFilter.java:240)
[junit] at 
org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer.testBuggyPunctuation(TestSmartChineseAnalyzer.java:51)
{code}

It's because null is being passed to ts.next in the final assertTrue line:

{code}
nt = ts.next(nt);
while (nt != null) {
  assertEquals(result[i], nt.term());
  i++;
  nt = ts.next(nt);
}
assertTrue(ts.next(nt) == null);
{code}

 Contrib analyzers need tests
 

 Key: LUCENE-1692
 URL: https://issues.apache.org/jira/browse/LUCENE-1692
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, 
 LUCENE-1692.txt


 The analyzers in contrib need tests, preferably ones that test the behavior 
 of all the Token 'attributes' involved (offsets, type, etc) and not just what 
 they do with token text.
 This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Some thoughts around the use of reader.isDeleted and hasDeletions

2009-06-18 Thread Shai Erera
I've made the changes to SegmentMerger and want to make the following
changes to IndexReader.document(): (1) don't call ensureOpen() and (2) don't
check isDeleted.

Question is - can I make these changes on the current impls, or do I need to
deprecate and come up w/ a new name? Here a new name is not a big challenge
- we can choose: doc() or getDocument() for example. I don't feel
rawDocument flows nicely (what's raw about it?)

IMO, even though these are back-compat changes (to runtime), they are not
likely to affect anyone. I mean, why would someone deliberately call
document() when the reader has already been closed (unless he doesn't know
it at the time of calling document()). For easy migration (if you rely on
that feature), I can add isClose()/isOpen() w/ a default impl to call
ensureOpen().

Or why to call document(doc) if the doc is deleted. What's the scenario?

Anyway, those two changes are necessary as our merging code calls them, but
already check that a doc is deleted or not before. So it's just a question
of a new method vs. a runtime change.

What do you think?

Shai

On Wed, Jun 10, 2009 at 6:39 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Wed, Jun 10, 2009 at 11:16 AM, Shai Erera ser...@gmail.com wrote:
  it makes sense because isDeleted() is essentially the *only* thing
  being done in the loop, and hence we can eliminate the loop entirely
 
  You mean that in case there is a matching segment, we can call
  matchingVectorsReader.rawDocs(rawDocLengths, rawDocLengths2, 0, maxDoc)?

 Right... or rather directly calculate numDocs and docNum instead of
 using the loop.

  But in case it does not have a matching segment, we'd still need to
 iterate
  on the docs, and copy the term vectors one by one, right?

 Right, and that's the case where I think duplicating the code to
 remove a single branch-predictable boolean flag isn't warranted as it
 won't result in a measurable performance increase.

 -Yonik
 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-1692) Contrib analyzers need tests

2009-06-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721461#action_12721461
 ] 

Mark Miller commented on LUCENE-1692:
-

heh -

+1 on fixing them all. Including reclaiming that 1 mb of space if we can ...

 Contrib analyzers need tests
 

 Key: LUCENE-1692
 URL: https://issues.apache.org/jira/browse/LUCENE-1692
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, 
 LUCENE-1692.txt


 The analyzers in contrib need tests, preferably ones that test the behavior 
 of all the Token 'attributes' involved (offsets, type, etc) and not just what 
 they do with token text.
 This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1692) Contrib analyzers need tests

2009-06-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721462#action_12721462
 ] 

Michael McCandless commented on LUCENE-1692:


Me too :)  Robert can you cons up a patch?  Which files can be safely removed 
from the DutchAnalyzer?  (stems/words.txt?)

 Contrib analyzers need tests
 

 Key: LUCENE-1692
 URL: https://issues.apache.org/jira/browse/LUCENE-1692
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, 
 LUCENE-1692.txt


 The analyzers in contrib need tests, preferably ones that test the behavior 
 of all the Token 'attributes' involved (offsets, type, etc) and not just what 
 they do with token text.
 This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1692) Contrib analyzers need tests

2009-06-18 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721463#action_12721463
 ] 

Robert Muir commented on LUCENE-1692:
-

michael, i guess junit from my eclipse != junit from ant, because it passes in 
eclipse...annoying

I will fix the test so it runs correctly from ant.

 Contrib analyzers need tests
 

 Key: LUCENE-1692
 URL: https://issues.apache.org/jira/browse/LUCENE-1692
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, 
 LUCENE-1692.txt


 The analyzers in contrib need tests, preferably ones that test the behavior 
 of all the Token 'attributes' involved (offsets, type, etc) and not just what 
 they do with token text.
 This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1692) Contrib analyzers need tests

2009-06-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721469#action_12721469
 ] 

Michael McCandless commented on LUCENE-1692:


Probably eclipse isn't running with asserts?

 Contrib analyzers need tests
 

 Key: LUCENE-1692
 URL: https://issues.apache.org/jira/browse/LUCENE-1692
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, 
 LUCENE-1692.txt


 The analyzers in contrib need tests, preferably ones that test the behavior 
 of all the Token 'attributes' involved (offsets, type, etc) and not just what 
 they do with token text.
 This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Fuzzy search change

2009-06-18 Thread eks dev

what would be the difference/benefit compared to standard lucene SpellChecker? 

If I I am not wrong:
- Lucene SpellChecker uses standard lucene index  as a storage for tokens 
instead of QDBM... meaning full inverted index with arbitrary N-grams length, 
with tf/idf/norms... not only HashMaptrigram, wordList 

- SC uses paradigm give me  N Best candidates (similarity), not only all 
above cutoff... this Similarity depends (standard lucene Similarity) on N-Gram 
frequency, (one could even use some sexy norms to fine tune words...)...  

If I've read your proposal correctly and did not miss something important, my 
suggestion would be to have a look at lucene SC 
(http://lucene.apache.org/java/2_3_2/api/contrib-spellchecker/org/apache/lucene/search/spell/SpellChecker.html)
 before you start 
 

have fun, 
eks



- Original Message 
 From: Michael McCandless luc...@mikemccandless.com
 To: java-dev@lucene.apache.org
 Sent: Thursday, 18 June, 2009 16:29:59
 Subject: Re: Fuzzy search change
 
 This would make an awesome addition to Lucene!
 
 This is similar to how Lucene's spellchecker identifies candidates, if
 I understand it right.
 
 Would you be able to port it to java?
 
 Mike
 
 On Thu, Jun 18, 2009 at 7:12 AM, Varun Dhussawrote:
  Hi,
 
  I wrote on this a long time ago, but haven't followed it up. I just finished
  a C++ implementation of a spell check module in my software. I borrowed the
  idea from Xapian. It is to use a trigram index to filter results, and then
  use Edit Distance on the filtered set. Would such a solution be acceptable
  to the Lucene Community? The details of my implementation are as follows:
 
  1) QDBM data store hash map
  2) Trigram tokenizer on the input string
  3) Data store hash(key,value) = (trigram, keyword_id_list
  4) Use trigram tokenizer and match with the trigram index
  5) Get the IDs within the input cutoff
  6) Run Edit Distance on the list and return
 
  In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit, it
  runs in 0.5 sec with a keyword record count of about 1,000,000 records.
  This is at least 3-4 times less than the current search times on Lucene.
 
  Since the results can be put in a thread safe hash table structure, the
  trigram search can be distributed over a thread pool also.
 
  Does this seem like a workable suggestion to the community?
 
  Regards
 
  --
  Varun Dhussa
  Product Architect
  CE InfoSystems (P) Ltd
  http://www.mapmyindia.com
 
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1692) Contrib analyzers need tests

2009-06-18 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721475#action_12721475
 ] 

Robert Muir commented on LUCENE-1692:
-

probably, fixed it and testing with ant now. ill upload it at least so you can 
verify the behavior i've discovered.

do you want me to include patch with the two bugfixes (chinese empty token and 
thai offsets), or give you something separate for those?

for the other 2 bugs:
fixing the Thai tokentype bug, well its really a bug in the standardtokenizer 
grammar. i wasn't sure you wanted to change that at this moment, but if you 
want it fixed let me know!
in my opinion: fix for DutchAnalyzer is to deprecate/remove the contrib 
completely, since it claims to do snowball stemming, why shouldnt someone just 
use the Dutch snowball stemmer from the contrib/snowball package!

  


 Contrib analyzers need tests
 

 Key: LUCENE-1692
 URL: https://issues.apache.org/jira/browse/LUCENE-1692
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, 
 LUCENE-1692.txt


 The analyzers in contrib need tests, preferably ones that test the behavior 
 of all the Token 'attributes' involved (offsets, type, etc) and not just what 
 they do with token text.
 This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1692) Contrib analyzers need tests

2009-06-18 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721504#action_12721504
 ] 

Robert Muir commented on LUCENE-1692:
-

ok got it,

the IDEOGRAPHIC FULL STOP is being converted into a comma token by the 
tokenizer.
if you use the default constructor: SmartChineseAnalyzer(), it won't load the 
default stopwords list, such as from my Luke screenshot.
if you instead instantiate it like this: SmartChineseAnalyzer(true), then it 
loads the default stopwords list.
the default stopwords list includes things like comma, so it ends out getting 
removed.

maybe its not a bug, but this is really non-obvious behavior...!


 Contrib analyzers need tests
 

 Key: LUCENE-1692
 URL: https://issues.apache.org/jira/browse/LUCENE-1692
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: example.jpg, LUCENE-1692.txt, LUCENE-1692.txt, 
 LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt


 The analyzers in contrib need tests, preferably ones that test the behavior 
 of all the Token 'attributes' involved (offsets, type, etc) and not just what 
 they do with token text.
 This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1692) Contrib analyzers need tests

2009-06-18 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1692:


Attachment: LUCENE-1692.txt

patch with new testcase demonstrating the chinese behavior.

 Contrib analyzers need tests
 

 Key: LUCENE-1692
 URL: https://issues.apache.org/jira/browse/LUCENE-1692
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: example.jpg, LUCENE-1692.txt, LUCENE-1692.txt, 
 LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt


 The analyzers in contrib need tests, preferably ones that test the behavior 
 of all the Token 'attributes' involved (offsets, type, etc) and not just what 
 they do with token text.
 This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Tests fail to compile on JDK 1.4?

2009-06-18 Thread Chris Hostetter

: We had some discussions about it, the easiest is, to set the bootclasspath
: in the javac/ task to an older rt.jar during compilation. Because this
: needs updates for e.g. Hudson (rt.jar missing) we said, that the one, who
: releases the final version should simply check this before on the
: compilation computer in the release process.

there are ways to automate this sanity check in ant, i took a stab at 
this a while back...
  https://issues.apache.org/jira/browse/LUCENE-718

...but i never moved forward with it becuase most people didn't seemed 
that concerned.



-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1692) Contrib analyzers need tests

2009-06-18 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721512#action_12721512
 ] 

Robert Muir commented on LUCENE-1692:
-

later tonight i can workup a patch to address the thai offset issue and at 
least javadoc'ing the chinese behavior.

if you think the addt'l 2 issues [thai tokentype, dutchanalyzer behavior/huge 
files] should be fixed or documented in some way, please let me know.


 Contrib analyzers need tests
 

 Key: LUCENE-1692
 URL: https://issues.apache.org/jira/browse/LUCENE-1692
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: example.jpg, LUCENE-1692.txt, LUCENE-1692.txt, 
 LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt


 The analyzers in contrib need tests, preferably ones that test the behavior 
 of all the Token 'attributes' involved (offsets, type, etc) and not just what 
 they do with token text.
 This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: madvise(ptr, len, MADV_SEQUENTIAL)

2009-06-18 Thread Jason Rutherglen
Hmm... So the list at the bottom of this page looks accurate?
http://www.gnu.org/software/hello/manual/gnulib/posix_005ffadvise.html

If it is, then posix_fadvise works on Linux only?

Perhaps madvise will be better then (judging by the much smaller unsupported
list)?  It seems to run on most platforms:
http://www.gnu.org/software/hello/manual/gnulib/madvise.html

On Wed, Jun 17, 2009 at 2:19 AM, Alan Bateman alan.bate...@sun.com wrote:

 Jason Rutherglen wrote:

 Alan,

 Do you think something like FileDescriptor.setAdvise (mirroring
 posix_fadvise) makes sense?

 -J

 Something like a posix_fadvise would be more appropriate for FileChannel or
 maybe as a usage hint when opening the file (the new APIs for opening files
 are extensible to allow for additional options in the future or even
 implementation specific options). I don't think we've had much interest in
 doing this, maybe because it would be a no-op on many operating systems.

 -Alan.



[jira] Updated: (LUCENE-1313) Near Realtime Search

2009-06-18 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1313:
-

Attachment: LUCENE-1313.patch

* TestThreadedOptimize passes, LogMergePolicy now filters the
segmentInfos based on the dir, rather than NRTMergePolicy
passing in only ramInfos or primaryInfos. LogMergePolicy is
careful to select contiguous segments, by passing in a subset of
segmentInfos, the merge policy selection broke down.

* TestIndexWriter.testAddIndexOnDiskFull,
testAddIndexesWithCloseNoWait fails, which I don't think
happened before. testAddIndexOnDiskFull fails when
autoCommit=true which I'm not sure is a valid test by the time
this patch goes in but it probably needs to be looked into. 

The other previous notes are still valid.

 Near Realtime Search
 

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


 Enable near realtime search in Lucene without external
 dependencies. When RAM NRT is enabled, the implementation adds a
 RAMDirectory to IndexWriter. Flushes go to the ramdir unless
 there is no available space. Merges are completed in the ram
 dir until there is no more available ram. 
 IW.optimize and IW.commit flush the ramdir to the primary
 directory, all other operations try to keep segments in ram
 until there is no more space.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-06-18 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721586#action_12721586
 ] 

Jason Rutherglen commented on LUCENE-1539:
--

I think it would be convenient to allow passing in the data files' absolute 
path, instead of assuming they're in a relative path.  

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, 
 sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1466) CharFilter - normalize characters before tokenizer

2009-06-18 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-1466:
---

Attachment: LUCENE-1466.patch

updated patch attached.
- sync trunk (smart chinese analyzer(LUCENE-1629), etc.)
- added a useful idiom to get ChatStream and make private CharReader constructor

 CharFilter - normalize characters before tokenizer
 --

 Key: LUCENE-1466
 URL: https://issues.apache.org/jira/browse/LUCENE-1466
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Affects Versions: 2.4
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch


 This proposes to import CharFilter that has been introduced in Solr 1.4.
 Please see for the details:
 - SOLR-822
 - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1466) CharFilter - normalize characters before tokenizer

2009-06-18 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721588#action_12721588
 ] 

Koji Sekiguchi edited comment on LUCENE-1466 at 6/18/09 7:04 PM:
-

updated patch attached.
- sync trunk (smart chinese analyzer(LUCENE-1629), etc.)
- added a useful idiom to get CharStream and make private CharReader constructor

  was (Author: koji):
updated patch attached.
- sync trunk (smart chinese analyzer(LUCENE-1629), etc.)
- added a useful idiom to get ChatStream and make private CharReader constructor
  
 CharFilter - normalize characters before tokenizer
 --

 Key: LUCENE-1466
 URL: https://issues.apache.org/jira/browse/LUCENE-1466
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Affects Versions: 2.4
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch


 This proposes to import CharFilter that has been introduced in Solr 1.4.
 Please see for the details:
 - SOLR-822
 - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1692) Contrib analyzers need tests

2009-06-18 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1692:


Attachment: LUCENE-1692.txt

patch with the two one-line fixes:
1. fix offsets for thai analyzer so highlighting, etc will work.
2. use stopwords list by default for smartchineseanalyzer so punctuation isn't 
indexed in a strange way.

i updated the testcases to reflect these.




 Contrib analyzers need tests
 

 Key: LUCENE-1692
 URL: https://issues.apache.org/jira/browse/LUCENE-1692
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: example.jpg, LUCENE-1692.txt, LUCENE-1692.txt, 
 LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, LUCENE-1692.txt, 
 LUCENE-1692.txt


 The analyzers in contrib need tests, preferably ones that test the behavior 
 of all the Token 'attributes' involved (offsets, type, etc) and not just what 
 they do with token text.
 This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org