Re: Lucene 2.9 and deprecated IR.open() methods

2009-10-02 Thread Earwin Burrfoot
On Sat, Oct 3, 2009 at 03:29, Uwe Schindler u...@thetaphi.de wrote:
 It is also probably a good idea to move various settings methods from
 IW to that builder and have IW immutable in regards to configuration.
 I'm speaking of the likes of setWriteLockTimeout, setRAMBufferSizeMB,
 setMergePolicy, setMergeScheduler, setSimilarity.

 IndexWriter.Builder iwb = IndexWriter.builder().
   writeLockTimeout(0).
   RAMBufferSize(config.indexationBufferMB).
   maxBufferedDocs(...).
   similarity(...).
   analyzer(...);

 ... = iwb.build(dir1);
 ... = iwb.build(dir2);

 A happy user of google-collections API :-) These builders are really cool!

I feel myself caught in the act.

There is still a couple of things bothering me.
1. Introducing a builder, we'll have a whole heap of deprecated
constructors that will hang there for eternity. And then users will
scream in frustration - This class has 14(!) constructors and all of
them are deprecated! How on earth am I supposed to create this thing?
2. If someone creates IW with some reflectish javabeanish tools - he's
busted. Not that I'm feeling compassionate for such a person.

 I like Earwin's version more. A builder is very flexible, because you can
 concat all your properties (like StringBuilder works with its append method
 returning itself) and create the instance at the end.
Besides (arguably) cleaner syntax, the lack of which is (arguably) a
curse of many Java libraries,
it also allows us to return a different concrete implementation of IW
without breaking back-compat,
and also to choose this concrete implementation based on settings
provided. If we feel like doing it at some point.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene 2.9 and deprecated IR.open() methods

2009-10-02 Thread Earwin Burrfoot
 Though what about required settings?  EG IW's builder must have
 Directory, Analyzer.  Would we pass these as up-front args to the
 initial builder?
I'd try to keep required settings at minimum. The only one absolutely
required, imho, is a Directory, and it's best to specify it in
create() method, so you could set all your IW parameters and then
build several instances, for different Directories for example.

If you decide to add more required settings, we're back to square one
- after a couple of years we're looking at 14 builder() methods.
Okay, there is a way. Take a look at how Guice handles binding
declarations in Modules - different builder methods may return
different interfaces implemented by 'this'.

class IndexWriter {
  public static NoAnalyzerYetBuilder builder() { return new
HiddenTrueBuilder(); }

  interface NoAnalyzerYetBuilder {
 NoAnalyzerYetBuilder setRAMBuffer(...)
 NoAnalyzerYetBuilder setUseCompound(...)
 
 Builder setAnalyzer(Analyzer)
  }

  interface Builder extends NoAnalyzerYetBuilder {
 Builder setRAMBuffer(...)
 Builder setUseCompound (...)
 
 IndexWriter create(Directory)
  }

  private static class HiddenTrueBuilder implements Builder {
  }

  
}

This approach looks nice from client-side, but is a mess to implement.


 And shouldn't we still specify the version up-front so we can improve
 defaults over time without breaking back-compat?  (Else, how can
 we change defaults?)

 EG:

  IndexWriter.builder(Version.29, dir, analyzer)
    .setRAMBufferSizeMB(128)
    .setUseCompoundFile(false)
    ...
    .create()

 ?

It's probably okay to specify version upfront. But also, nothing bad
happens if we do it like:
IndexWriter.builder().
  defaultsFor(Version.29).
  setRam...

 Mike

 On Fri, Oct 2, 2009 at 7:45 PM, Earwin Burrfoot ear...@gmail.com wrote:
 On Sat, Oct 3, 2009 at 03:29, Uwe Schindler u...@thetaphi.de wrote:
 It is also probably a good idea to move various settings methods from
 IW to that builder and have IW immutable in regards to configuration.
 I'm speaking of the likes of setWriteLockTimeout, setRAMBufferSizeMB,
 setMergePolicy, setMergeScheduler, setSimilarity.

 IndexWriter.Builder iwb = IndexWriter.builder().
   writeLockTimeout(0).
   RAMBufferSize(config.indexationBufferMB).
   maxBufferedDocs(...).
   similarity(...).
   analyzer(...);

 ... = iwb.build(dir1);
 ... = iwb.build(dir2);

 A happy user of google-collections API :-) These builders are really cool!

 I feel myself caught in the act.

 There is still a couple of things bothering me.
 1. Introducing a builder, we'll have a whole heap of deprecated
 constructors that will hang there for eternity. And then users will
 scream in frustration - This class has 14(!) constructors and all of
 them are deprecated! How on earth am I supposed to create this thing?
 2. If someone creates IW with some reflectish javabeanish tools - he's
 busted. Not that I'm feeling compassionate for such a person.

 I like Earwin's version more. A builder is very flexible, because you can
 concat all your properties (like StringBuilder works with its append method
 returning itself) and create the instance at the end.
 Besides (arguably) cleaner syntax, the lack of which is (arguably) a
 curse of many Java libraries,
 it also allows us to return a different concrete implementation of IW
 without breaking back-compat,
 and also to choose this concrete implementation based on settings
 provided. If we feel like doing it at some point.

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene 2.9 and deprecated IR.open() methods

2009-10-02 Thread Earwin Burrfoot
 Call me old fashioned, but I like how the non constructor params are set
 now.
And what happens when you index some docs, change these params, index
more docs, change params, commit? Let's throw in some threads?
You either end up writing really hairy state control code, or just
leave it broken, with Don't change parameters after you start pumping
docs through it! plea covering your back somewhere in JavaDocs.
If nothing else, having stuff 'final' keeps JIT really happy.

 And for some reason I like a config object over a builder pattern for
 the required constructor params.
Builder pattern allows you to switch concrete implementations as you
please, taking parameters into account or not.
Besides that there's no real difference. I prefer builder, but that's just me :)

 Thats just me though.

 Michael McCandless wrote:
 OK, I agree, using the builder approach looks compelling!

 Though what about required settings?  EG IW's builder must have
 Directory, Analyzer.  Would we pass these as up-front args to the
 initial builder?

 And shouldn't we still specify the version up-front so we can improve
 defaults over time without breaking back-compat?  (Else, how can
 we change defaults?)

 EG:

   IndexWriter.builder(Version.29, dir, analyzer)
     .setRAMBufferSizeMB(128)
     .setUseCompoundFile(false)
     ...
     .create()

 ?

 Mike

 On Fri, Oct 2, 2009 at 7:45 PM, Earwin Burrfoot ear...@gmail.com wrote:

 On Sat, Oct 3, 2009 at 03:29, Uwe Schindler u...@thetaphi.de wrote:

 It is also probably a good idea to move various settings methods from
 IW to that builder and have IW immutable in regards to configuration.
 I'm speaking of the likes of setWriteLockTimeout, setRAMBufferSizeMB,
 setMergePolicy, setMergeScheduler, setSimilarity.

 IndexWriter.Builder iwb = IndexWriter.builder().
   writeLockTimeout(0).
   RAMBufferSize(config.indexationBufferMB).
   maxBufferedDocs(...).
   similarity(...).
   analyzer(...);

 ... = iwb.build(dir1);
 ... = iwb.build(dir2);

 A happy user of google-collections API :-) These builders are really cool!

 I feel myself caught in the act.

 There is still a couple of things bothering me.
 1. Introducing a builder, we'll have a whole heap of deprecated
 constructors that will hang there for eternity. And then users will
 scream in frustration - This class has 14(!) constructors and all of
 them are deprecated! How on earth am I supposed to create this thing?
 2. If someone creates IW with some reflectish javabeanish tools - he's
 busted. Not that I'm feeling compassionate for such a person.


 I like Earwin's version more. A builder is very flexible, because you can
 concat all your properties (like StringBuilder works with its append method
 returning itself) and create the instance at the end.

 Besides (arguably) cleaner syntax, the lack of which is (arguably) a
 curse of many Java libraries,
 it also allows us to return a different concrete implementation of IW
 without breaking back-compat,
 and also to choose this concrete implementation based on settings
 provided. If we feel like doing it at some point.

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Optimization and Corruption Issues

2009-10-01 Thread Earwin Burrfoot
 2.0 is pre Mike's fabulous indexing updates - which just for one means
 one thread doing the merging rather than multiple. I'm sure overall its
 much slower.
If you're doing a full optimize, you're still using a single thread. Am I wrong?


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Optimization and Corruption Issues

2009-10-01 Thread Earwin Burrfoot
 If you're doing a full optimize, you're still using a single thread. Am I 
 wrong?

 Depends on how many merges are required, and, the merge scheduler.  In
 this case (w/ 7000 segments, which is way too many, normally!),
 assuming ConcurrentMergeScheduler, multiple threads will be used since
 many merges will be pending.

 When it gets down to the last (enormous) merge, it's only one thread.
I'm speaking about full optimize. Is there any way to do it more
efficiently then running a single, last (enormous) merge?
If you try to parallelize, you're merging some documents several times
(more work) and killing your disks, as merges are mostly IO-bound.


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Query Parsing was Fwd: Lab - Esqueranto

2009-09-25 Thread Earwin Burrfoot
We use antlr, though without its tree api, it's a bit of overkill.
It directly builds a query in our intermediate format which is
traversed for synonym/phrase detection and converted to lucene query.

The library/language itself is pretty easy to learn, flexible, and has
a nice IDE.

On Fri, Sep 25, 2009 at 19:17, Peter Keegan peterlkee...@gmail.com wrote:
 We're using Antlr for our query parsing. What I like about it:
 - flexibility of separate lexer/parser and tree api
 - excellent IDE for building/testing the grammar
 However, the learning curve was quite long for me, although this was my
 first real encounter with parsers.

 Peter

 On Fri, Sep 25, 2009 at 9:58 AM, Grant Ingersoll gsing...@apache.org
 wrote:

 Has anyone looked/used Antlr for Query Parser capabilities?  There was
 some discussion over at Apache Labs that might bear discussing in light of
 our new Query Parser contrib.

 Begin forwarded message:

 From: Tim Williams william...@gmail.com
 Date: August 17, 2009 8:09:04 PM EDT
 To: l...@labs.apache.org
 Subject: Re: Lab - Esqueranto
 Reply-To: l...@labs.apache.org

 On Mon, Aug 17, 2009 at 7:00 PM, Grant Ingersollgsing...@apache.org
 wrote:

 On Aug 2, 2009, at 1:43 PM, Tim Williams wrote:

 Hi Martin,

 Sure, if it works like I envision it, Lucene would just be *one*

 concrete tree grammar implementation - there could be others (ie

 OracleText), I'm thinking it is broader than one implementation -

 otherwise, I reckon it's Yet Another Lucene Query Parser (YALQP).

 For more practical reasons, I'm not a Lucene committer and it'd be

 slow going to play around with this through JIRA patches to their

 sandbox.

 FWIW, Lucene has recently added a new, more flexible Query Parser that

 allows for separation of the various pieces (syntax, intermediate

 representation, Lucene Query).  You might want to check it out and see how

 that fits

 Thanks Grant, yeah I've looked at that and it seems really (overly?)
 complex for what I'm trying to achieve.  It seems to re-implement much
 of the goodness that antlr provides for free.  For example, with antlr
 I already get a lexer/parser grammar separate from the tree grammar.
 So, to plug in a new parser syntax is trivial - just implement a new
 lexer/parser grammar that provides tree rewrites consistent with a
 lucene tree grammar.  Conversely, to implement a new concrete
 implementation, just implement a new tree grammar for the existing
 lexer/parser grammar.

 Of course, maybe I'll get down this road and realize how naive my path
 is and just switch over.  For now, just looking at a query parser
 that, by itself, is approaching the size of the lucene core code base
 is intimidating:)  Thanks for the pointer though, I'm subscribed over
 there and keep an eye out for progress on the new parser

 Thanks,
 --tim

 -
 To unsubscribe, e-mail: labs-unsubscr...@labs.apache.org
 For additional commands, e-mail: labs-h...@labs.apache.org








-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: How to leverage the LogMergePolicy calibrateSizeByDeletes patch in Solr ?

2009-09-22 Thread Earwin Burrfoot
On Tue, Sep 22, 2009 at 19:08, Yonik Seeley yo...@lucidimagination.com wrote:
 On Tue, Sep 22, 2009 at 10:48 AM, Michael McCandless
 luc...@mikemccandless.com wrote:
 John are you using IndexWriter.setMergedSegmentWarmer, so that a newly
 merged segment is warmed before it's put into production (returned
 by getReader)?

 I'm still not sure I see the reason for complicating the IndexWriter
 with warming... can't this be done just as efficiently (if not more
 efficiently) in user/application space?
+1


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: who clears attributes?

2009-08-11 Thread Earwin Burrfoot
On Tue, Aug 11, 2009 at 15:09, Yonik Seeleyyo...@lucidimagination.com wrote:
 On Tue, Aug 11, 2009 at 6:50 AM, Robert Muirrcm...@gmail.com wrote:
 On Tue, Aug 11, 2009 at 4:28 AM, Michael Buschbusch...@gmail.com wrote:
 There was a performance test in Solr that apparently ran much slower
 after upgrading to the new Lucene jar. This test is testing a rather
 uncommon scenario: very very short documents.

 Actually, its more uncommon than that: its very very short documents,
 without implementing reusableTokenStream()
 this makes it basically a benchmark of ctor cost... doesn't really
 benchmark the token api in my opinion.

 You would be surprized... there are quite a few Solr users that have
 relatively short documents... or even if they are sizeable documents,
 they have up to hundreds of short metadata-type fields (generally a
 token or two).

 Reusing TokenStreams has become a must in Solr IMO since construction
 costs (hashmap lookups, etc) and GC costs (larger objects) have been
 growing.  I'm focused on that now...

 Robert's taking a crack at fixing things up so users can actually
 create reusable analyzers out of our filters:
 https://issues.apache.org/jira/browse/LUCENE-1794

+1. We don't use Solr, but have quite a bunch of medium and
short-sized documents. Plus heaps of metadata fields.

I'm yet to read Uwe's example, but I feel I'm a bit misunderstood by
some of you. My gripe with new API is not that it brings us troubles
(which are solved one way or another), it is that the switch and
associated migration costs bring zero benefits in immediate and remote
future.
The only person that tried to disprove this claim is Uwe. Others
either say the problems are solved, so it's okay to move to the new
API, or this will be usable when flexindexing arrives. Sorry, the
last phrase doesn't hold its place, this API is orthogonal to
flexindexing, or at least nobody has shown the opposite.
So, what I'm arguing against is adding some code (and forcing users to
migrate) just because we can, with no other reasons.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: who clears attributes?

2009-08-11 Thread Earwin Burrfoot
 The only person that tried to disprove this claim is Uwe. Others
 either say the problems are solved, so it's okay to move to the new
 API, or this will be usable when flexindexing arrives.

 Others (not me) have spent a lot of time going over this before (more than
 once I think) - they prob are just sick of retyping. Lots of searchable
 archives out there though.

Okay, I'll dig into them. Sorry for being a bother.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2009-08-11 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741868#action_12741868
 ] 

Earwin Burrfoot commented on LUCENE-1799:
-

I think right now this can be implemented as a delegating Directory.

 Unicode compression
 ---

 Key: LUCENE-1799
 URL: https://issues.apache.org/jira/browse/LUCENE-1799
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor

 In lucene-1793, there is the off-topic suggestion to provide compression of 
 Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
 original supposition was that it provided a more compact index.
 This led to the comment that a different or compressed encoding would be a 
 generally useful feature. 
 BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
 with an implementation in ICU. If Lucene provide it's own implementation a 
 freely avIlable, royalty-free license would need to be obtained.
 SCSU is another Unicode compression algorithm that could be used. 
 An advantage of these methods is that they work on the whole of Unicode. If 
 that is not needed an encoding such as iso8859-1 (or whatever covers the 
 input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: indexing_slowdown_with_latest_lucene_udpate

2009-08-10 Thread Earwin Burrfoot
Or, we can just throw that detection out of the window, for less
smooth back-compat experience, less hacky code and no slowdown.

On Mon, Aug 10, 2009 at 19:02, Uwe Schindleru...@thetaphi.de wrote:
 The question is, if that would get better if the reflection calls are only
 done one time per class using a IdentityHashMapClass,Boolean. The other
 reflection code in AttributeSource uses a static cache for such type of
 things (e.g. the Attribute - AttributeImpl mappings in AttributeSource.
 DefaultAttributeFactory.getClassForInterface()).

 I could do some tests about that and supply a patch. I was thinking about
 that but throwed it away (as it needs some synchronization on the cache Map
 which may also overweigh).

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de

 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: Monday, August 10, 2009 4:48 PM
 To: java-dev@lucene.apache.org
 Subject: Re: indexing_slowdown_with_latest_lucene_udpate

 Robert Muir wrote:
  This is real and not just for very short docs.
 Yes, you still pay the cost for longer docs, but it just becomes less
 important the longer the docs, as it plays a smaller role. Load a ton of
 one term docs, and it might be 50-60% slower - add a bunch of articles,
 and it might be closer to 20%-15% (I don't know the numbers, but the
 longer I made the docs, the less % slowdown, obviously). Still a good hit,
 but a short doc test magnafies the problem.

 It affects things no matter what, but when you don't do much tokenizing,
 normalizing, the cost of the reflection/tokenstream init dominates.

 - Mark



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers

2009-08-10 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741372#action_12741372
 ] 

Earwin Burrfoot commented on LUCENE-1793:
-

bq. I am guessing the rationale for the current code is to try to reduce index 
size? (since these languages are double-byte encoded in Unicode). 
Rationale was most probably to support existing non-unicode 
systems/databases/files, whatever. My say is - anyone still holding onto koi8, 
cp1251 and friends should silently do harakiri.

 remove custom encoding support in Greek/Russian Analyzers
 -

 Key: LUCENE-1793
 URL: https://issues.apache.org/jira/browse/LUCENE-1793
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Reporter: Robert Muir
Priority: Minor
 Attachments: LUCENE-1793.patch


 The Greek and Russian analyzers support custom encodings such as KOI-8, they 
 define things like Lowercase and tokenization for these.
 I think that analyzers should support unicode and that conversion/handling of 
 other charsets belongs somewhere else. 
 I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: who clears attributes?

2009-08-10 Thread Earwin Burrfoot
I'll deviate from the topic somewhat.
What are exact benefits that new tokenstream API yields? Are we sure
we want it released with 2.9?
By now I only see various elaborate problems, but haven't seen a
single piece of code becoming simpler.

On Mon, Aug 10, 2009 at 21:50, Uwe Schindleru...@thetaphi.de wrote:
 Yes. Is there a way to enforce this for all Tokenizers automatically? As
 incrementToken() will be abstract in 3.0, there cannot be a default impl. So
 all Tokenizers should call clearAttributes() as first call in
 incrementToken().

 Then we have still the problem of the slow iterator creation (which was
 speed up a little bit by removing the unmodifiable wrapper). This can be
 solved by using an additional ArrayList in AttributeSource that gets all
 AttributeImpl instances, but this would bring an additional initialization
 cost() on creating the Tokenizer chain.

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: Monday, August 10, 2009 7:42 PM
 To: java-dev@lucene.apache.org
 Subject: Re: who clears attributes?

 Thinking through this a little more, I don't see an alternative to the
 tokenizer clearing all attributes at the start of incrementToken().

 Consider a DefaultPayloadTokenFilter that only sets a payload if one
 isn't already set - it's clear that this filter can't clear the
 payload attribute, so it must be cleared by the head of the chain -
 the tokenizer.  Right?

 -Yonik
 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: who clears attributes?

2009-08-10 Thread Earwin Burrfoot
On Mon, Aug 10, 2009 at 22:50, Grant Ingersollgsing...@apache.org wrote:

 On Aug 10, 2009, at 2:00 PM, Earwin Burrfoot wrote:

 I'll deviate from the topic somewhat.
 What are exact benefits that new tokenstream API yields? Are we sure
 we want it released with 2.9?
 By now I only see various elaborate problems, but haven't seen a
 single piece of code becoming simpler.

 In theory, it sets up for more indexing/searching possibilities in 3.0, but
 in the meantime, it is proving to be quite problematic due to back
 compatibility restrictions.
I'm not quite sure which exact indexing/searching possibilities does
the new API open for us.
Some new ways of handling text? Okay, I'd like each token to have one
more number in addition to posIncr, so I can have my 'true multiword
synonyms'. Maybe, just maybe, there will be a pair of other
extensions. Usecases here are really scarce. Plus, if they're
successful/useful, they will most probably be included out of the box,
so we don't need much flexibility here.
Something other than text? Numbers, with good rangequeries. Dates.
Spatial data. Your-type-here. For these, flexible text-processing
stream-oriented API is totally useless.

 I have serious doubts about releasing this new API until these performance
 issues are resolved and better proven out from a usability standpoint.
 It simply is too much to swallow for most users, as
 Analyzers/TokenStreams/etc. are easily the most common place for people to
 inject their own capabilities and there is no way we should be
 taking a 30% hit in performance for some theoretical speed up and new search
 capability 1 year from now.
I have a feeling that best idea, before more damage is done, is to
rollback this new API, store the patch, and try rolling it out once
again, when we have usecases/more code to justify it.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: 2.5 versus 2.9, was Re: who clears attributes?

2009-08-10 Thread Earwin Burrfoot
On Tue, Aug 11, 2009 at 00:37, Michael Buschbusch...@gmail.com wrote:
 On 8/10/09 1:30 PM, Grant Ingersoll wrote:


 I think your 2.5 proposal has drawbacks: if we release 2.5 now to test
 the new major features in the field, then do you want to stop adding new
 features to trunk until we release 2.9 to not have the same situation then
 again? How long should this testing in the field take?

 I don't know.  How long does any release cycle last in Lucene?


 But we'll always have the same problem, no? We need to find a solution that
 allows us to keep adding features; dedicated deprecation releases are not
 good.
Parallel branches. The only way of simultaneously satisfying several
conflicting needs in software development.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: who clears attributes?

2009-08-10 Thread Earwin Burrfoot
On Tue, Aug 11, 2009 at 00:54, Uwe Schindleru...@thetaphi.de wrote:
  I have serious doubts about releasing this new API until these
  performance issues are resolved and better proven out from a
  usability
  standpoint.
 
  I think LUCENE-1796 has fixed the performance problems, which was
  caused by
  a missing reflection-cache needed for bw compatibility. I hope to
  commit
  soon!
 
  2.9 may be a little bit slower when you mix old and new API and do
  not reuse
  Tokenizers (but Robert is already adding reusableTokenStream to all
  contrib
  analyzers). When the backwards layer is removed completely or
  setOnlyUseNewAPI is enabled, there is no speed impact at all.
 


 The Analysis features of Lucene are the single most common place where
 people enhance Lucene.  Very few add queries, or muck with field
 caches, but they do write their own Analyzers and TokenStreams,
 etc.    Within that, mixing old and new is likely the most common case
 for everyone who has made their own customizations, so a little bit
 slower is something I'd rather not live with just for the sake of
 some supposed goodness in a year or two.

 But because of this flexibility, we added the backwards layer. The old style
 with setUseNewAPI was not flexible at all, and nobody would move his
 Tokenizers to the new API without that flexibility (maybe he uses external
 analyzer packages not yet updated).

 With a little bit I mean the cost of wrapping the old and new API is
 really minimal, it is just an if statement and a method call, hopefully
 optimized away by the JVM. In my tests the standard deviation between
 different test runs was much higher than the difference between mixing
 old/new API (on Win32), so it is not really sure, that the cost comes from
 the delegation.

 The only case that is really slower is (now minimized cost of creation in
 TokenStream.init, if you not reuse TokenStreams: Two LinkedHashMaps have
 to be created and setup). But this is not caused by the backwards layer.

 Uwe


Uwe, the problems I raised are still here - what is the benefit of
moving to this API right now? I see none. What is the future benefit
of moving to this API? It is very vague. Someone said this API is
generic, but there are different kinds of genericity. Are we sure we
abstracted the right thing? How will it be used? Where are examples?

Right now it is an excercise in programming, which forces us to do new
and new excercises. Very exciting, very rewarding, but as of now -
pointless.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: pieces missing in reusable analyzers?

2009-08-10 Thread Earwin Burrfoot
 I had thought that implementing reusable analyzers in solr was going
 to be cake... but either I'm missing something, or Lucene is missing
 something.

 Here's the way that one used to create custom analyzers:

 class CustomAnalyzer extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
    return new LowerCaseFilter(new NGramTokenFilter(new
 StandardTokenizer(reader)));
  }
 }


 Now let's try to make this reusable:

 class CustomAnalyzer2 extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
    return new LowerCaseFilter(new NGramTokenFilter(new
 StandardTokenizer(reader)));
  }

 �...@override
  public TokenStream reusableTokenStream(String fieldName, Reader
 reader) throws IOException {
    TokenStream ts = getPreviousTokenStream();
    if (ts == null) {
      ts = tokenStream(fieldName, reader);
      setPreviousTokenStream(ts);
      return ts;
    } else {
      // uh... how do I reset a token stream?
      return ts;
    }
  }
 }


 See the missing piece?  Seems like TokenStream needs a reset(Reader r)
 method or something?

I'm just keeping a reference to Tokenizer, so I can reset it with a
new reader. Though this situation is awkward, TS definetly does not
need a reset(Reader).


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: who clears attributes?

2009-08-10 Thread Earwin Burrfoot
 Well, I have real use cases for it, but all of it is still missing the
 biggest piece:  search side support.  It's the 900 lb. elephant in the room.
   The 500 lb. elephant is the fact that all these attributes, AIUI, require
 you to hook in your own indexing chain, etc. in order to even be indexed,
 which is all package private stuff.   It's not even clear to me what happens
 right now if you were to, say have a Token Stream that, say, had only one
 Attribute on it and none of the existing attributes (term buffer, length,
 position, etc.)  Please correct me if I am wrong, I still don't have a deep
 understanding of it all.
Even pseudocode would be good. Custom indexing chain for abstract
attributes sounds like one of microsoft.com definitions - serious,
determined, but vague.
If you take current Token and start throwing away some of its fields,
the resulting index contents are obvious for one combinations and
absurd for others. You don't need this new API to handle obvious ones.

 Oh, and now it seems the new QP is dependent on it all.
That's why I said earlier before more damage is done.

 Michael has always been up front that this new API is in preparation for 
 flexible indexing. It doesn't give us the goodness - he has laid out the 
 reasons for moving before the goodness comes more than once I think.
My problem is not waiting for 'goodness'. It is that I don't currently
see what goodness will come from this API even in remote future.
That's why I am asking! :)

 Flexible indexing will lead to all kinds of little cool things - the likes of 
 which have been discussed a lot in older emails. It will likely lead to 
 things we cannot predict as well.
 Everything will be more flexible. It also could play a part in CSF, and work 
 on allowing custom files to plug into merging. Plus everything else thats 
 been mentioned (pfor, etc)  I've been sold on the long term benefits. I 
 don't think you need these API for them, but its my understanding it helps 
 solve part of the equation.
Yeah. I too, would like to see all these little cool things, and I
don't think we need this API for them.
Flexible indexing is going to handle various different datatypes
besides text, so I can only reiterate - it cannot rely on generic
stream-based text-handling API for consuming data.

 A bunch of issues have come up. To my knowledge, they have been addressed 
 with vigor every time. If someone is unhappy with how something has been 
 addressed, and it
 needs to be addressed further, please speak up. Otherwise, I don't think the 
 sky is falling - I think the new API is being shaken out.
API is born dead without usecases. If a year later we get closer to
flexindexing it is supposed to support, and then we understand we
missed some crucial thing - WHAM! our back-compat policy kicks in and
makes our lives miserable once more.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: pieces missing in reusable analyzers?

2009-08-10 Thread Earwin Burrfoot
 I'm just keeping a reference to Tokenizer, so I can reset it with a
 new reader. Though this situation is awkward, TS definetly does not
 need a reset(Reader).

 Then how do you notify the other filters that they should reset their state?
 TokenStream.reset()?  The javadoc specifies that it's actually used
 for something else - but perhaps it can be reused for this purpose?
Yes, exactly. TokenFilter override of reset() chains the call to the
input stream.

 I specifically used NGramTokenFilter in my example because it did use
 internal state (and it's a bug that it has no way to reset that state
 currently).
My filters are all my own, so they reset and chain properly.


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: ConcurrentMergeScheduler and MergePolicy question

2009-08-09 Thread Earwin Burrfoot
On Sun, Aug 9, 2009 at 08:38, Jason
Rutherglenjason.rutherg...@gmail.com wrote:
 You don't have to copy. You can have one machine optimize your indexes
 whilst other serves user requests, then they switch roles, rinse,
 repeat. This approach also works with sharding, and more than 2-way
 mirroring.

 What does the un-optimized server do after the other server is
 optimized? The search requests go to the newly optimized server,
 however if we're mirroring, the 2nd server now needs the
 optimized index as well?

The second server now stops servicing requests and starts optimizing.
You can also keep them running together for some time, depending on
how you're serious about always running on optimized index.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: ConcurrentMergeScheduler and MergePolicy question

2009-08-08 Thread Earwin Burrfoot
 Perhaps the ideal search system architecture that requires
 optimizing is to dedicate a server to it, copy the index to the
 optimize server, do the optimize, copy the index off (to a
 search server) and start again for the next optimize task.

 I wonder how/if this would work with Hadoop/HDFS as copying
 100GB around would presumably tie up the network? Also, I've
 found rsyncing large optimized indexes to be time consuming and
 wreaks havoc on the searcher server's IO subsystem. Usually this
 is unacceptable for the user as the queries will suddenly
 degrade.
You don't have to copy. You can have one machine optimize your indexes
whilst other serves user requests, then they switch roles, rinse,
repeat. This approach also works with sharding, and more than 2-way
mirroring.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Attributes, DocConsumer, Flexible Indexing, etc.

2009-08-06 Thread Earwin Burrfoot
I always thought flexible indexing is not only for storing your
app-specific data next to terms/docs.
Something more along the lines of efficient geo search, or ability to
try out various index encoding schemes without patching lucene.

In other words, this is something that can be a basis for
easy/pluggable implementation of payload-type functionality, not
vice-versa.

On Thu, Aug 6, 2009 at 01:55, Grant Ingersollgsing...@apache.org wrote:

 On Aug 5, 2009, at 4:35 PM, Michael Busch wrote:

 On 8/5/09 1:07 PM, Grant Ingersoll wrote:

 Hmmm, OK.

 Random, somewhat uneducated thought:  Why not just define the codecs to
 create byte arrays?  Then we can use the existing payload capability much
 like I do with the DelimitedPayloadTokenFilter.   We'd probably have to make
 sure this still worked with Similarity, but it seems like it could.
  Thinking on this some more, seems like this could work already with a a
 AttributePayloadEncoder or something like an AttributeToPayloadTokenFilter
 (I know, horrible name).  Then, on the Query side, the AttributeTermQuery is
 just a glorified BoostingTermQuery with some callback hooks for dealing with
 the Attribute (but maybe that isn't even needed), either that or we just
 provide helper methods to the Similarity class so that people can easily
 decode the byte array into an Attribute.  In fact, maybe all that needs to
 happen is the Attributes need to define encode/decode methods that
 (de)serialize a byte array.

 Seems like this approach would require very little in the way of changes
 to Lucene, but I admit it isn't fully baked in my mind just yet.  It also
 has the nice benefit that all the work we did on Payloads isn't wasted.

 This is resonating more and more with me.  What do you think?


 Well I think this would be a nice way of using the payloads better.

 However, the idea behind flexible indexing is that you can customize the
 on-disk encoding in a way that it is as efficient as it can be for your
 particular use case. E.g. for payloads we currently have to encode the
 length. An application might not have to do that if it knows exactly what is
 stored.
 Then there's only the Payload API that returns you a byte array. It
 basically copies the contents of the IndexInput (usually a
 BufferedIndexInput, which means array copy from the byte buffer to the
 payload byte array). If the application knows exactly what is stored it can
 read/decode it more efficiently.

 Yeah, but really are you saving that much?  4 bytes per token?  It's not
 like you are saving much in terms of seeks, since you are already there
 anyway.  The only downside I see is a slightly larger index.  Would be
 interesting to try it out and see.





 The latter inefficiency we could solve by improving the payloads API: it
 could return an IndexInput instead of the byte array and the caller could
 consume it more efficient.

 This is also interesting, but again requires some changes.  With what I'm
 proposing, I think it could be done very simply w/o any API changes, and we
 just need to expose some of the IndexInput/Output helper classes a bit more
 to make it easier for people to encode/decode their stuff.  Then, just
 documentation and some more Boosting*Query (Peter has already done
 BoostingNearQuery) and I think you have a pretty good flexible indexing AND
 searching capability all in a back compatible way using our existing code.


 So I agree that we could use Attributes to make the payloads feature
 better usable, but I don't think it will be a replacement for flexible
 indexing.




 Michael

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: IndexWriter.getReader usage

2009-08-03 Thread Earwin Burrfoot
 The biggest win for NRT was switching to per-segment Collector because
 that meant we could re-use FieldCache entries for all segments that
 hadn't changed.
In my opinion, this switch was enough to get as NRT-ey, as you want.
Fusing IR/IW together makes Lucene a great deal more complicated and
just a milli-tad closer to RT.

 I'm curious as to how it obviates the need for a RAM dir?
 In my use case I use them to create indexes and perform searches.
 In the latter it avoids OS file indexing and virus scanner contention (40 min 
 reduced to less than 2 min).
Isn't indexing your indexes (omg), checking them for viruses and
striving for performance is ..err.. a little bit self-contradictary?

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Java caching of low-level index data?

2009-08-03 Thread Earwin Burrfoot
 I'm curious if anyone has thought about (or even tried) caching the low-level 
 index data in Java, rather than
 in the OS.  For example, at the IndexInput level there could be an LRU cache 
 of byte[] blocks, similar to
 how a RDBMS caches index pages.  (Conveniently, BufferedIndexInput already 
 reads in 1k chunks.) You
 would reverse the advice above and instead make your JVM heap as large as 
 possible (or at least large
 enough to achieve a desired speed/space tradeoff).
I did something along these lines. It sucks. Having big Java heaps
ends you up with insane GC times. Loading GB-sized files into a bunch
of byte[1024] also wastes memory. Best bet by now is to rely on
mmap/file cache.

 I think swappiness is exactly the configuration that tells Linux just
 how happily it should swapp out application memory for IO cache vs
 other IO cache for new IO cache.
swappiness is roughly the percentage of free memory after which OS
starts searching for pages suitable for paging out. If set to low
values, OS wakes up in near-OOM conditions. If set to high values, as
soon as OS decides (according to some heuristics) that page is
eligible for page-out, it goes to disk.


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-18 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732938#action_12732938
 ] 

Earwin Burrfoot commented on LUCENE-1748:
-

bq. We should drop PayloadSpans and just add getPayload to Spans. This should 
be a compile time break.
+1

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.9, 3.0, 3.1


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731939#action_12731939
 ] 

Earwin Burrfoot commented on LUCENE-1748:
-

bq. Shouldnt it throw a runtime exception (unsupported operation?) or something?
What is the difference between adding an abstract method and adding a method 
that throws exception in regards to jar drop in back compat?
In both cases when you drop your new jar in you get an exception, except in the 
latter case exception is deferred.

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731972#action_12731972
 ] 

Earwin Burrfoot commented on LUCENE-1748:
-

I took a glance at the code, the whole getPayloadSpans deal is a herecy.

Each and every implementation looks like:
  public PayloadSpans getPayloadSpans(IndexReader reader) throws IOException {
return (PayloadSpans) getSpans(reader);
  }

Moving it to the base SpanQuery is broken equally to current solution, but 
yields much less strange copypaste.

I also have a faint feeling that if you expose a method like
ClassA method();
you can then upgrade it to
SubclassOfClassA method();
without breaking drop-in compatibility, which renders getPayloadSpans vs 
getSpans alternative totally useless

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731972#action_12731972
 ] 

Earwin Burrfoot edited comment on LUCENE-1748 at 7/16/09 7:54 AM:
--

I took a glance at the code, the whole getPayloadSpans deal is a herecy.

Each and every implementation looks like:
  public PayloadSpans getPayloadSpans(IndexReader reader) throws IOException {
return (PayloadSpans) getSpans(reader);
  }

Moving it to the base SpanQuery is broken equally to current solution, but 
yields much less strange copypaste.

-I also have a faint feeling that if you expose a method like-
-ClassA method();-
-you can then upgrade it to-
-SubclassOfClassA method();-
-without breaking drop-in compatibility, which renders getPayloadSpans vs 
getSpans alternative totally useless-
Ok, I'm wrong.

  was (Author: earwin):
I took a glance at the code, the whole getPayloadSpans deal is a herecy.

Each and every implementation looks like:
  public PayloadSpans getPayloadSpans(IndexReader reader) throws IOException {
return (PayloadSpans) getSpans(reader);
  }

Moving it to the base SpanQuery is broken equally to current solution, but 
yields much less strange copypaste.

I also have a faint feeling that if you expose a method like
ClassA method();
you can then upgrade it to
SubclassOfClassA method();
without breaking drop-in compatibility, which renders getPayloadSpans vs 
getSpans alternative totally useless
  
 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1743) MMapDirectory should only mmap large files, small files should be opened using SimpleFS/NIOFS

2009-07-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731632#action_12731632
 ] 

Earwin Burrfoot commented on LUCENE-1743:
-

The initial motive for the issue seems wrong to me.

bq. For most operating systems, mapping a file into memory is more expensive 
than reading or writing a few tens of kilobytes of data via the usual read and 
write methods. From the standpoint of performance it is generally only worth 
mapping relatively large files into memory.
It is probably right if you're doing a single read through the file. If you're 
opening/mapping it and do thousands of repeated reads, mmap would be superior, 
because after initial mapping it's just a memory access VS system call for 
file.read().

 MMapDirectory should only mmap large files, small files should be opened 
 using SimpleFS/NIOFS
 -

 Key: LUCENE-1743
 URL: https://issues.apache.org/jira/browse/LUCENE-1743
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1


 This is a followup to LUCENE-1741:
 Javadocs state (in FileChannel#map): For most operating systems, mapping a 
 file into memory is more expensive than reading or writing a few tens of 
 kilobytes of data via the usual read and write methods. From the standpoint 
 of performance it is generally only worth mapping relatively large files into 
 memory.
 MMapDirectory should get a user-configureable size parameter that is a lower 
 limit for mmapping files. All files with a sizelimit should be opened using 
 a conventional IndexInput from SimpleFS or NIO (another configuration option 
 for the fallback?).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1743) MMapDirectory should only mmap large files, small files should be opened using SimpleFS/NIOFS

2009-07-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731632#action_12731632
 ] 

Earwin Burrfoot edited comment on LUCENE-1743 at 7/15/09 12:14 PM:
---

The initial motive for the issue seems wrong to me.

bq. For most operating systems, mapping a file into memory is more expensive 
than reading or writing a few tens of kilobytes of data via the usual read and 
write methods. From the standpoint of performance it is generally only worth 
mapping relatively large files into memory.
It is probably right if you're doing a single read through the file. If you're 
opening/mapping it and do thousands of repeated reads, mmap would be superior, 
because after initial mapping it's just a memory access VS system call for 
file.read().

Add: In case you're not doing repeated reads, and just read these small files 
once from time to time, you can totally neglect speed difference between mmap 
and fopen. At least it doesn't warrant increased complexity.

  was (Author: earwin):
The initial motive for the issue seems wrong to me.

bq. For most operating systems, mapping a file into memory is more expensive 
than reading or writing a few tens of kilobytes of data via the usual read and 
write methods. From the standpoint of performance it is generally only worth 
mapping relatively large files into memory.
It is probably right if you're doing a single read through the file. If you're 
opening/mapping it and do thousands of repeated reads, mmap would be superior, 
because after initial mapping it's just a memory access VS system call for 
file.read().
  
 MMapDirectory should only mmap large files, small files should be opened 
 using SimpleFS/NIOFS
 -

 Key: LUCENE-1743
 URL: https://issues.apache.org/jira/browse/LUCENE-1743
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1


 This is a followup to LUCENE-1741:
 Javadocs state (in FileChannel#map): For most operating systems, mapping a 
 file into memory is more expensive than reading or writing a few tens of 
 kilobytes of data via the usual read and write methods. From the standpoint 
 of performance it is generally only worth mapping relatively large files into 
 memory.
 MMapDirectory should get a user-configureable size parameter that is a lower 
 limit for mmapping files. All files with a sizelimit should be opened using 
 a conventional IndexInput from SimpleFS or NIO (another configuration option 
 for the fallback?).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1743) MMapDirectory should only mmap large files, small files should be opened using SimpleFS/NIOFS

2009-07-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731639#action_12731639
 ] 

Earwin Burrfoot commented on LUCENE-1743:
-

bq. My problem was more with all these small files like segments_ and 
segments.gen or *.del files. They are small and only used one time.
I can only reiterate my point. These files aren't opened like 10k files per 
second, so your win is going to be in the order of microseconds per reopen - at 
the cost of increased complexity.

 MMapDirectory should only mmap large files, small files should be opened 
 using SimpleFS/NIOFS
 -

 Key: LUCENE-1743
 URL: https://issues.apache.org/jira/browse/LUCENE-1743
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1


 This is a followup to LUCENE-1741:
 Javadocs state (in FileChannel#map): For most operating systems, mapping a 
 file into memory is more expensive than reading or writing a few tens of 
 kilobytes of data via the usual read and write methods. From the standpoint 
 of performance it is generally only worth mapping relatively large files into 
 memory.
 MMapDirectory should get a user-configureable size parameter that is a lower 
 limit for mmapping files. All files with a sizelimit should be opened using 
 a conventional IndexInput from SimpleFS or NIO (another configuration option 
 for the fallback?).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: A Comparison of Open Source Search Engines

2009-07-06 Thread Earwin Burrfoot
I'd say out of these libraries only Lucene and Sphinx are worth mentioning.

There's also MG4J, which wasn't covered and has a nice algorithmic background.
Anybody knows other interesting open-source search engines?

On Tue, Jul 7, 2009 at 00:39, John Wangjohn.w...@gmail.com wrote:
 Vik did a very nice job.
 One thing the experiment did not mention is that Lucene handles incremental
 updates, whereas many of the other competitors do not. So the indexing
 performance comparison is not really fair.
 -John

 On Mon, Jul 6, 2009 at 8:06 AM, Sean Owen sro...@gmail.com wrote:


 http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/

 I imagine many of you already saw this -- Lucene does pretty well in
 this shootout.
 The only area it tended to lag, it seems, is memory usage and speed in
 some cases.

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

2009-07-02 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726571#action_12726571
 ] 

Earwin Burrfoot commented on LUCENE-1488:
-

bq. There is no morphological processing or any other language-specific 
functionality in this patch... 
I'm speaking of stemming in ArabicAnalyzer. Why can't you use its stemming 
tokenfilter over all ICU goodness from this patch? Everything else 
ArabicAnalyzer consists of might as well be deleted right after.

 issues with standardanalyzer on multilingual text
 -

 Key: LUCENE-1488
 URL: https://issues.apache.org/jira/browse/LUCENE-1488
 Project: Lucene - Java
  Issue Type: Wish
  Components: contrib/analyzers
Reporter: Robert Muir
Priority: Minor
 Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt, 
 LUCENE-1488.txt


 The standard analyzer in lucene is not exactly unicode-friendly with regards 
 to breaking text into words, especially with respect to non-alphabetic 
 scripts.  This is because it is unaware of unicode bounds properties.
 I actually couldn't figure out how the Thai analyzer could possibly be 
 working until i looked at the jflex rules and saw that codepoint range for 
 most of the Thai block was added to the alphanum specification. defining the 
 exact codepoint ranges like this for every language could help with the 
 problem but you'd basically be reimplementing the bounds properties already 
 stated in the unicode standard. 
 in general it looks like this kind of behavior is bad in lucene for even 
 latin, for instance, the analyzer will break words around accent marks in 
 decomposed form. While most latin letter + accent combinations have composed 
 forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
 i suppose). 
 I've got a partially tested standardanalyzer that uses icu Rule-based 
 BreakIterator instead of jflex. Using this method you can define word 
 boundaries according to the unicode bounds properties. After getting it into 
 some good shape i'd be happy to contribute it for contrib but I wonder if 
 theres a better solution so that out of box lucene will be more friendly to 
 non-ASCII text. Unfortunately it seems jflex does not support use of these 
 properties such as [\p{Word_Break = Extend}] so this is probably the major 
 barrier.
 Thanks,
 Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Improving TimeLimitedCollector

2009-06-27 Thread Earwin Burrfoot
Why don't you use Thread.interrupt(), .isInterrupted() ?

On Sat, Jun 27, 2009 at 16:16, Shai Ereraser...@gmail.com wrote:
 A downside of breaking it out into static methods like this is that a
 thread cannot run 1 time-limited activity simultaneously but I guess that
 might be a reasonable restriction.

 I'm not sure I understand that - how can a thread run 1 activity
 simultaneously anyway, and how's your impl in TimeLimitingIndexReader allows
 it to do so? You use the thread as a key to the map. Am I missing something?

 Anyway, I think we can let go of the static methods and make them instance
 methods. I think .. if I want to use time limited activities, I should
 create a TimeLimitedThreadActivity instance and pass it around, to
 TimeLimitingIndexReader (and maybe in the future to a similar **IndexWriter)
 and any other custom code I have which I want to put a time limit on.

 A static class has the advantage of not needing to pass it around
 everywhere, and is accessible from everywhere, so that if we discover that
 limiting on IndexReader is not enough, and we want some of the scorers to
 check more frequently if they should stop, we won't need to pass that
 instance all the way down to them.

 I don't mind keeping it static, but I also don't mind if it will be an
 instance passed around, since currently it's only passed to IndexReader.

 Are you going to open an issue for that? Seems like a nice addition to me.
 Do you think it should belong in core or contrib? If 'core' then if that's
 possible I'd like to see all timeout classes under one package, including
 TimeLimitingCollector (which until 2.9 we can safely move around as we
 want).
 I don't mind working on that w/ you, if you want.

 Shai

 On Sat, Jun 27, 2009 at 2:24 PM, Mark Harwood markharw...@yahoo.co.uk
 wrote:

 Thanks for the feedback, Shai.
 So I guess you're suggesting breaking this out into a general utility
 class e.g. something like:
 class TimeLimitedThreadActivity
 {
         //called by client
         public static void startTimeLimitedActivity(long
 maxTimePermitted).
         public static void endTimeLimitedActivity()
        //called by resources (reader/writers) that need to be shared
 fairly by threads
       public static void checkActivityNotElapsed(); //throws some form of
 runtime exception
 }
 A downside of breaking it out into static methods like this is that a
 thread cannot run 1 time-limited activity simultaneously but I guess that
 might be a reasonable restriction.

 Aside, how about using a PQ for the threads' times, or a TreeMap. That
  will save looping over the collection to find the next candidate. Just an
  implementation detail though.
 Yep, that was one of the rough edges - I just wanted to get raw timings
 first for the all the is timed out? checks we're injecting into reader
 calls.
 Cheers
 Mark

 On 27 Jun 2009, at 11:37, Shai Erera wrote:

 I like the overall approach. However it's very local to an IndexReader.
 I.e., if someone wanted to limit other operations (say indexing), or does
 not use an IndexReader (for a Scorer impl maybe), one cannot reuse it.

 What if we factor out the timeout logic to a Timeout class (I think it can
 be a static class, with the way you implemented it) and use it in
 TimeLimitingIndexReader? That class can offer a method check() which will do
 the internal logic (the 'if' check and throw exception). It will be similar
 to the current ensureOpen() followed by an operation.

 It might be considered more expensive since it won't check a boolean, but
 instead call a check() method, but it will be more reusable. Also,
 ensureOpen today is also a method call, so I don't think Timeout.check() is
 that bad. We can even later create a TimeLimitingIndexWriter and document
 Timeout class for other usage by external code.

 Aside, how about using a PQ for the threads' times, or a TreeMap. That
 will save looping over the collection to find the next candidate. Just an
 implementation detail though.

 Shai

 On Sat, Jun 27, 2009 at 3:31 AM, Mark Harwood markharw...@yahoo.co.uk
 wrote:

 Going back to my post re TimeLimitedIndexReaders - here's an incomplete
 but functional prototype:
 http://www.inperspective.com/lucene/TimeLimitedIndexReader.java
 http://www.inperspective.com/lucene/TestTimeLimitedIndexReader.java

 The principle is that all reader accesses check a volatile variable
 indicating something may have timed out (no need to check thread locals
 etc.) If and only if a time out has been noted threadlocals are checked to
 see which thread should throw a timeout exception.
 All time-limited use of reader must be wrapped in try...finally calls to
 indicate the start and stop of a timed set of activities. A background
 thread maintains the next anticipated timeout deadline and simply waits
 until this is reached or the list of planned activities changes with new
 deadlines.

 Performance seems reasonable on my Wikipedia index:
 //some tests for heavy use of termenum/term 

[jira] Commented: (LUCENE-1342) 64bit JVM crashes on Linux

2009-06-26 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724441#action_12724441
 ] 

Earwin Burrfoot commented on LUCENE-1342:
-

bq. Sun can't ignore a HotSpot compiler bug, can they? 
They are safely ignoring CMS collector bugs on 64bit archs.

 64bit JVM crashes on Linux
 --

 Key: LUCENE-1342
 URL: https://issues.apache.org/jira/browse/LUCENE-1342
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.0.0
 Environment: 2.6.18-53.el5 x86_64  GNU/Linux
 Java(TM) SE Runtime Environment (build 1.6.0_04-b12)
Reporter: Kevin Richards
 Attachments: hs_err_pid10565.log, hs_err_pid21301.log, 
 hs_err_pid27882.log


 Whilst running lucene in our QA environment we received the following 
 exception. This problem was also reported here : 
 http://confluence.atlassian.com/display/KB/JSP-20240+-+POSSIBLE+64+bit+JDK+1.6+update+4+may+have+HotSpot+problems.
 Is this a JVM problem or a problem in Lucene.
 #
 # An unexpected error has been detected by Java Runtime Environment:
 #
 #  SIGSEGV (0xb) at pc=0x2adb9e3f, pid=2275, tid=1085356352
 #
 # Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b19 mixed mode linux-amd64)
 # Problematic frame:
 # V  [libjvm.so+0x1fce3f]
 #
 # If you would like to submit a bug report, please visit:
 #   http://java.sun.com/webapps/bugreport/crash.jsp
 #
 ---  T H R E A D  ---
 Current thread (0x2aab0007f000):  JavaThread CompilerThread0 daemon 
 [_thread_in_vm, id=2301, stack(0x40a13000,0x40b14000)]
 siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), 
 si_addr=0x
 Registers:
 RAX=0x, RBX=0x2aab0007f000, RCX=0x, 
 RDX=0x2aab00309aa0
 RSP=0x40b10f60, RBP=0x40b10fb0, RSI=0x2aaab37d1ce8, 
 RDI=0x2aaad000
 R8 =0x2b40cd88, R9 =0x0ffc, R10=0x2b40cd90, 
 R11=0x2b410810
 R12=0x2aab00ae60b0, R13=0x2aab0a19cc30, R14=0x40b112f0, 
 R15=0x2aab00ae60b0
 RIP=0x2adb9e3f, EFL=0x00010246, CSGSFS=0x0033, 
 ERR=0x0004
   TRAPNO=0x000e
 Top of Stack: (sp=0x40b10f60)
 0x40b10f60:   2aab0007f000 
 0x40b10f70:   2aab0a19cc30 0001
 0x40b10f80:   2aab0007f000 
 0x40b10f90:   40b10fe0 2aab0a19cc30
 0x40b10fa0:   2aab0a19cc30 2aab00ae60b0
 0x40b10fb0:   40b10fe0 2ae9c2e4
 0x40b10fc0:   2b413210 2b413350
 0x40b10fd0:   40b112f0 2aab09796260
 0x40b10fe0:   40b110e0 2ae9d7d8
 0x40b10ff0:   2b40f3d0 2aab08c2a4c8
 0x40b11000:   40b11940 2aab09796260
 0x40b11010:   2aab09795b28 
 0x40b11020:   2aab08c2a4c8 2aab009b9750
 0x40b11030:   2aab09796260 40b11940
 0x40b11040:   2b40f3d0 2023
 0x40b11050:   40b11940 2aab09796260
 0x40b11060:   40b11090 2b0f199e
 0x40b11070:   40b11978 2aab08c2a458
 0x40b11080:   2b413210 2023
 0x40b11090:   40b110e0 2b0f1fcf
 0x40b110a0:   2023 2aab09796260
 0x40b110b0:   2aab08c2a3c8 40b123b0
 0x40b110c0:   2aab08c2a458 40b112f0
 0x40b110d0:   2b40f3d0 2aab00043670
 0x40b110e0:   40b11160 2b0e808d
 0x40b110f0:   2aab000417c0 2aab009b66a8
 0x40b11100:    2aab009b9750
 0x40b0:   40b112f0 2aab009bb360
 0x40b11120:   0003 40b113d0
 0x40b11130:   01002aab0052d0c0 40b113d0
 0x40b11140:   00b3 40b112f0
 0x40b11150:   40b113d0 2aab08c2a108 
 Instructions: (pc=0x2adb9e3f)
 0x2adb9e2f:   48 89 5d b0 49 8b 55 08 49 8b 4c 24 08 48 8b 32
 0x2adb9e3f:   4c 8b 21 8b 4e 1c 49 8d 7c 24 10 89 cb 4a 39 34 
 Stack: [0x40a13000,0x40b14000],  sp=0x40b10f60,  free 
 space=1015k
 Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
 code)
 V  [libjvm.so+0x1fce3f]
 V  [libjvm.so+0x2df2e4]
 V  [libjvm.so+0x2e07d8]
 V  [libjvm.so+0x52b08d]
 V  [libjvm.so+0x524914]
 V  [libjvm.so+0x51c0ea]
 V  [libjvm.so+0x519f77]
 V  [libjvm.so+0x519e7c]
 V  [libjvm.so+0x519ad5]
 V  [libjvm.so+0x1e0cf4]
 V  [libjvm.so+0x2a0bc0]
 V  [libjvm.so+0x528e03]
 V  [libjvm.so+0x51c0ea]
 V  [libjvm.so+0x519f77]
 V  [libjvm.so+0x519e7c]
 V  [libjvm.so+0x519ad5]
 V

Re: Improving TimeLimitedCollector

2009-06-24 Thread Earwin Burrfoot
Having scorers check timeouts while advancing will definetly increase
the frequency of said timeouts.

On Wed, Jun 24, 2009 at 13:13, eks deveks...@yahoo.co.uk wrote:
 Re: I think such a parameter should not exist on individual search methods
 since it's more of a global setting (i.e., I want my searches to be limited
 to 5 seconds, always, not just for a particular query). Right?

 I am not sure about this one, we had cases where one phisical index served
 two logical indices with different requirements for clients. having Timeout
 settable per Query is nice to have.

 At the end of day, with such timeout you support Quality/Time compromise
 settings:
 if you need all results, be ready to wait longer and set longer timeout
 if you need SOME results quickly than reduce this timeout

 that should be idealy user decision

 
 From: Shai Erera ser...@gmail.com
 To: java-dev@lucene.apache.org
 Sent: Wednesday, 24 June, 2009 10:55:50
 Subject: Re: Improving TimeLimitedCollector

 But TimeLimitingCollector's logic is coded in its collect() method. The top
 scorer calls nextDoc() or advance() on all its sub-scorers, and only when a
 match is found it calls collect().

 If we want the sub-scorers to check whether they should abort, we'd need to
 revamp (liked the word :)) TimeLimitingCollector, to be something like
 CheckAbort SegmentMerger uses. I.e., the top scorer will pass such an
 instance to its sub scorers, which will call a TimeLimit.check() or
 something and if the time limit has expired this call will throw a
 TimeExceededException (like TLC).

 We can enable this by adding another parameter to IndexSearcher whether
 searches should be limited by time, and what's the time limit. It will then
 instantiate that object and pass it to its Scorer and so on. I think such a
 parameter should not exist on individual search methods since it's more of a
 global setting (i.e., I want my searches to be limited to 5 seconds, always,
 not just for a particular query). Right?

 Another option would be to add a setTimeout method on Query, which will use
 it when it constructs its Scorer. The shortcoming of this is that if I want
 to use someone else's query which did not implement setTimeout, then I'll
 need to build a TimeOutQueryWrapper that will wrap a Query, and implement
 the timeout logic, but that's get complicated.

 I think the Collector approach makes the most sense to me, since it's the
 only object I fully control in the search process. I cannot control Query
 implementations, and I cannot control the decisions made by IndexSearcher.
 But I can always wrap someone else's Collector with TLC and pass it to
 search().

 Shai

 On Wed, Jun 24, 2009 at 12:26 AM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:

 As we're revamping collectors, weights, and scorers, perhaps we
 can push time limiting into the individual subscorers? Currently
 on a boolean query, we're timing out the query at the top level
 which doesn't work well if the subqueries exceed the time limit.






-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1712) Set default precisionStep for NumericField and NumericRangeFilter

2009-06-23 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722996#action_12722996
 ] 

Earwin Burrfoot commented on LUCENE-1712:
-

Having half of your methods constantly fail with an exception depending on 
constructor parameter. That just screams - Split me into two classes!

 Set default precisionStep for NumericField and NumericRangeFilter
 -

 Key: LUCENE-1712
 URL: https://issues.apache.org/jira/browse/LUCENE-1712
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Michael McCandless
Priority: Minor
 Fix For: 2.9


 This is a spinoff from LUCENE-1701.
 A user using Numeric* should not need to understand what's
 under the hood in order to do their indexing  searching.
 They should be able to simply:
 {code}
 doc.add(new NumericField(price, 15.50);
 {code}
 And have a decent default precisionStep selected for them.
 Actually, if we add ctors to NumericField for each of the supported
 types (so the above code works), we can set the default per-type.  I
 think we should do that?
 4 for int and 6 for long was proposed as good defaults.
 The default need not be perfect, as advanced users can always
 optimize their precisionStep, and for users experiencing slow
 RangeQuery performance, NumericRangeQuery with any of the defaults we
 are discussing will be much faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1715) DirectoryIndexReader finalize() holding TermInfosReader longer than necessary

2009-06-23 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723224#action_12723224
 ] 

Earwin Burrfoot commented on LUCENE-1715:
-

I object nulling references in attempt to speed up GC. It's totally useless on 
any decent JVM implementation and if someone uses indecent JVM, I doubt he's 
concerned with his app efficiency.

 DirectoryIndexReader finalize() holding TermInfosReader longer than necessary
 -

 Key: LUCENE-1715
 URL: https://issues.apache.org/jira/browse/LUCENE-1715
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
 Environment: Sun JDK 6 update 12 64-bit, Debian Lenny
Reporter: Brian Groose
Assignee: Michael McCandless
 Fix For: 2.9


 DirectoryIndexReader has a finalize method, which causes the JDK to keep a 
 reference to the object until it can be finalized.  SegmentReader and 
 MultiSegmentReader are subclasses that contain references to, potentially, 
 hundreds of megabytes of cached data in a TermInfosReader.
 Some options would be removing finalize() from DirectoryIndexReader (it 
 releases a write lock at the moment) or possibly nulling out references in 
 various close() and doClose() methods throughout the class hierarchy so that 
 the finalizable object doesn't references the Term arrays.
 Original mailing list message:
 http://mail-archives.apache.org/mod_mbox/lucene-java-user/200906.mbox/%3c7a5cb4a7bbce0c40b81c5145c326c31301a62...@numevp06.na.imtn.com%3e

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1715) DirectoryIndexReader finalize() holding TermInfosReader longer than necessary

2009-06-23 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723225#action_12723225
 ] 

Earwin Burrfoot commented on LUCENE-1715:
-

And support removing finalizers everywhere if their only point is to guard 
against forgotten close().

 DirectoryIndexReader finalize() holding TermInfosReader longer than necessary
 -

 Key: LUCENE-1715
 URL: https://issues.apache.org/jira/browse/LUCENE-1715
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
 Environment: Sun JDK 6 update 12 64-bit, Debian Lenny
Reporter: Brian Groose
Assignee: Michael McCandless
 Fix For: 2.9


 DirectoryIndexReader has a finalize method, which causes the JDK to keep a 
 reference to the object until it can be finalized.  SegmentReader and 
 MultiSegmentReader are subclasses that contain references to, potentially, 
 hundreds of megabytes of cached data in a TermInfosReader.
 Some options would be removing finalize() from DirectoryIndexReader (it 
 releases a write lock at the moment) or possibly nulling out references in 
 various close() and doClose() methods throughout the class hierarchy so that 
 the finalizable object doesn't references the Term arrays.
 Original mailing list message:
 http://mail-archives.apache.org/mod_mbox/lucene-java-user/200906.mbox/%3c7a5cb4a7bbce0c40b81c5145c326c31301a62...@numevp06.na.imtn.com%3e

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1715) DirectoryIndexReader finalize() holding TermInfosReader longer than necessary

2009-06-23 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723289#action_12723289
 ] 

Earwin Burrfoot commented on LUCENE-1715:
-

There's in fact one case where nulling harms. I'm going to try making as much 
of IR as possible immutable and final. Load everything upfront on 
creation/reopen (or don't load if IR is created for, say, merging). Unlike 
nulling references, making frequently accessed fields final does have an impact 
under adequate JVMs.

Well, nulling can be added now and removed when/if I finish my IR stuff.

 DirectoryIndexReader finalize() holding TermInfosReader longer than necessary
 -

 Key: LUCENE-1715
 URL: https://issues.apache.org/jira/browse/LUCENE-1715
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
 Environment: Sun JDK 6 update 12 64-bit, Debian Lenny
Reporter: Brian Groose
Assignee: Michael McCandless
 Fix For: 2.9


 DirectoryIndexReader has a finalize method, which causes the JDK to keep a 
 reference to the object until it can be finalized.  SegmentReader and 
 MultiSegmentReader are subclasses that contain references to, potentially, 
 hundreds of megabytes of cached data in a TermInfosReader.
 Some options would be removing finalize() from DirectoryIndexReader (it 
 releases a write lock at the moment) or possibly nulling out references in 
 various close() and doClose() methods throughout the class hierarchy so that 
 the finalizable object doesn't references the Term arrays.
 Original mailing list message:
 http://mail-archives.apache.org/mod_mbox/lucene-java-user/200906.mbox/%3c7a5cb4a7bbce0c40b81c5145c326c31301a62...@numevp06.na.imtn.com%3e

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1607) String.intern() faster alternative

2009-06-23 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723352#action_12723352
 ] 

Earwin Burrfoot commented on LUCENE-1607:
-

Okay, let's have an extra class and ability to switch impls. I liked that 
static method could get inlined (at least its short-path), but that's not 
necessary.

Except I'd like the javadoc demand each impl to be String.intern()-compatible. 
There's nothing bad in it, as in any decent impl an unique string will be 
String.intern()'ed one time at most. And the case when you get an infinite flow 
of unique strings is degenerate anyway, you have to fix something, not deal 
with it. On the other hand, we can remove This should never be changed after 
other Lucene APIs have been used clause.

rewrite 'for' as 'for (Entry e = first;e != null;e = e.next)' for clarity?
'Entry[] arr = cache;' - this can be skipped? 'cache' is already final and 
optimizer loves finals. Plus further down the method you use both cache[slot] 
and arr[slot]. Or am I missing some voodoo?
If check around 'nextToLast = e' can also be removed?
'public String intern(char[] arr, int offset, int len)' - is this needed?

 String.intern() faster alternative
 --

 Key: LUCENE-1607
 URL: https://issues.apache.org/jira/browse/LUCENE-1607
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Earwin Burrfoot
Assignee: Yonik Seeley
 Fix For: 2.9

 Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
 LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
 LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch


 By using our own interned string pool on top of default, String.intern() can 
 be greatly optimized.
 On my setup (java 6) this alternative runs ~15.8x faster for already interned 
 strings, and ~2.2x faster for 'new String(interned)'
 For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1677) Remove GCJ IndexReader specializations

2009-06-23 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723355#action_12723355
 ] 

Earwin Burrfoot commented on LUCENE-1677:
-

Mike, are we going to postpone actual deletion of these classes for 3.0?

 Remove GCJ IndexReader specializations
 --

 Key: LUCENE-1677
 URL: https://issues.apache.org/jira/browse/LUCENE-1677
 Project: Lucene - Java
  Issue Type: Task
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9


 These specializations are outdated, unsupported, most probably pointless due 
 to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you 
 are going to ask people on java-user, anybody replied that they need it?). 
 While giving nothing, they make SegmentReader instantiation code look real 
 ugly.
 If nobody objects, I'm going to post a patch that removes these from Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1677) Remove GCJ IndexReader specializations

2009-06-23 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723378#action_12723378
 ] 

Earwin Burrfoot commented on LUCENE-1677:
-

I thought we're doing everything right now as it is broken already.
And I have a half-written patch with SR cleanup after GCJ removal :)


 Remove GCJ IndexReader specializations
 --

 Key: LUCENE-1677
 URL: https://issues.apache.org/jira/browse/LUCENE-1677
 Project: Lucene - Java
  Issue Type: Task
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9


 These specializations are outdated, unsupported, most probably pointless due 
 to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you 
 are going to ask people on java-user, anybody replied that they need it?). 
 While giving nothing, they make SegmentReader instantiation code look real 
 ugly.
 If nobody objects, I'm going to post a patch that removes these from Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache

2009-06-22 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722769#action_12722769
 ] 

Earwin Burrfoot commented on LUCENE-1701:
-

Using 4 for int, 6 for long. Dates-as-longs look a bit sad on 8.

 Add NumericField and NumericSortField, make plain text numeric parsers public 
 in FieldCache, move trie parsers to FieldCache
 

 Key: LUCENE-1701
 URL: https://issues.apache.org/jira/browse/LUCENE-1701
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index, Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1701-test-tag-special.patch, LUCENE-1701.patch, 
 LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, 
 LUCENE-1701.patch, NumericField.java


 In discussions about LUCENE-1673, Mike  me wanted to add a new NumericField 
 to o.a.l.document specific for easy indexing. An alternative would be to add 
 a NumericUtils.newXxxField() factory, that creates a preconfigured Field 
 instance with norms and tf off, optionally a stored text (LUCENE-1699) and 
 the TokenStream already initialized. On the other hand 
 NumericUtils.newXxxSortField could be moved to NumericSortField.
 I and Yonik tend to use the factory for both, Mike tends to create the new 
 classes.
 Also the parsers for string-formatted numerics are not public in FieldCache. 
 As the new SortField API (LUCENE-1478) makes it possible to support a parser 
 in SortField instantiation, it would be good to have the static parsers in 
 FieldCache public available. SortField would init its member variable to them 
 (instead of NULL), so making code a lot easier (FieldComparator has this ugly 
 null checks when retrieving values from the cache).
 Moving the Trie parsers also as static instances into FieldCache would make 
 the code cleaner and we would be able to hide the hack 
 StopFillCacheException by making it private to FieldCache (currently its 
 public because NumericUtils is in o.a.l.util).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache

2009-06-22 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722769#action_12722769
 ] 

Earwin Burrfoot edited comment on LUCENE-1701 at 6/22/09 12:18 PM:
---

Using 4 for int, 6 for long. Dates-as-longs look a bit sad on 8.

Though, if you want really fast dates, chosing hour/day/month/year as precision 
steps is vastly superior, plus it also clicks well with user-selected ranges. 
Still, I dumped this approach for uniformity and clarity.

  was (Author: earwin):
Using 4 for int, 6 for long. Dates-as-longs look a bit sad on 8.
  
 Add NumericField and NumericSortField, make plain text numeric parsers public 
 in FieldCache, move trie parsers to FieldCache
 

 Key: LUCENE-1701
 URL: https://issues.apache.org/jira/browse/LUCENE-1701
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index, Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1701-test-tag-special.patch, LUCENE-1701.patch, 
 LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, 
 LUCENE-1701.patch, NumericField.java


 In discussions about LUCENE-1673, Mike  me wanted to add a new NumericField 
 to o.a.l.document specific for easy indexing. An alternative would be to add 
 a NumericUtils.newXxxField() factory, that creates a preconfigured Field 
 instance with norms and tf off, optionally a stored text (LUCENE-1699) and 
 the TokenStream already initialized. On the other hand 
 NumericUtils.newXxxSortField could be moved to NumericSortField.
 I and Yonik tend to use the factory for both, Mike tends to create the new 
 classes.
 Also the parsers for string-formatted numerics are not public in FieldCache. 
 As the new SortField API (LUCENE-1478) makes it possible to support a parser 
 in SortField instantiation, it would be good to have the static parsers in 
 FieldCache public available. SortField would init its member variable to them 
 (instead of NULL), so making code a lot easier (FieldComparator has this ugly 
 null checks when retrieving values from the cache).
 Moving the Trie parsers also as static instances into FieldCache would make 
 the code cleaner and we would be able to hide the hack 
 StopFillCacheException by making it private to FieldCache (currently its 
 public because NumericUtils is in o.a.l.util).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache

2009-06-22 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722775#action_12722775
 ] 

Earwin Burrfoot commented on LUCENE-1701:
-

 Design for today.
 And spend two years deprecating and supporting today's designs after you get 
 a better thing tomorrow. Back-compat Lucene-style and agile design aren't 
 something that marries well.
 donating something to Lucene means casting it in concrete.
 We can't let fear of back-compat prevent us from making progress.
My point was that strict back-compat prevents people from donating work which 
is not yet finalized. They either lose comfortable volatility of private code, 
or have to maintain two versions of it - private and Lucene.

 NRT seems to tread the same path, and I'm not sure it's going to win that 
 much turnaround time after newly-introduced per-segment collection.
 I agree, per-segment collection was the bulk of the gains needed for
 NRT. This was a big change and a huge step forward in simple reopen
 turnaround.
I vote it for the most frustrating (in terms of adopting your custom code) and 
most useful change of 2.9 :)

 But, not having to write  read deletes to disk, not commit (fsync)
 from writer in order to see those changes in reader should also give
 us decent gains. fsync is surprisingly and intermittently costly.
I'm not sure this can't be achieved without messing with IR/W guts so much. 
Guys from LinkedIn that drive this feature (if i'm not mistaken), they had a 
prior solution with separate indexes, one on disk, one in RAM. Per-segment 
collection adds superfast reopens and MultiReader that is way greater than 
MultiSearcher - you can finally do adequate fast searches across separate 
indexes. Do we still need to add complexity for minor performance gains?

 And this integration lets us take it a step further with LUCENE-1313,
 where recently created segments can remain in RAM and be shared with
 the reader.
RAMDirectory?

 Some time ago I finished a first version of IR plugins, and enjoy pretty low 
 reopen times (field/facet/filter cache warmups included). (Yes, I'm going to 
 open an issue for plugins once they stabilize enough)
 I'm confused: I thought that effort was to make SegmentReader's
 components fully pluggable? (Not to actually change what components
 SegmentReader is creating). EG does this modularization alter the
 approach to NRT? I thought they were orthogonal.
Yes, they are orthonogal. This was yet another praise to per-segment collection 
and an example of how this approach can be extended on your custom stuff (like 
filtercache).


 Add NumericField and NumericSortField, make plain text numeric parsers public 
 in FieldCache, move trie parsers to FieldCache
 

 Key: LUCENE-1701
 URL: https://issues.apache.org/jira/browse/LUCENE-1701
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index, Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1701-test-tag-special.patch, LUCENE-1701.patch, 
 LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, 
 LUCENE-1701.patch, NumericField.java


 In discussions about LUCENE-1673, Mike  me wanted to add a new NumericField 
 to o.a.l.document specific for easy indexing. An alternative would be to add 
 a NumericUtils.newXxxField() factory, that creates a preconfigured Field 
 instance with norms and tf off, optionally a stored text (LUCENE-1699) and 
 the TokenStream already initialized. On the other hand 
 NumericUtils.newXxxSortField could be moved to NumericSortField.
 I and Yonik tend to use the factory for both, Mike tends to create the new 
 classes.
 Also the parsers for string-formatted numerics are not public in FieldCache. 
 As the new SortField API (LUCENE-1478) makes it possible to support a parser 
 in SortField instantiation, it would be good to have the static parsers in 
 FieldCache public available. SortField would init its member variable to them 
 (instead of NULL), so making code a lot easier (FieldComparator has this ugly 
 null checks when retrieving values from the cache).
 Moving the Trie parsers also as static instances into FieldCache would make 
 the code cleaner and we would be able to hide the hack 
 StopFillCacheException by making it private to FieldCache (currently its 
 public because NumericUtils is in o.a.l.util).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java

Re: Shouldn't IndexWriter.commit(Map) accept Properties instead?

2009-06-22 Thread Earwin Burrfoot
 What other issues would we be taking on by using Java's serialization here...?
It's insanely slow. Though, that doesn't apply to a once-per-commit call.

The other point is, if you store Object, you can no longer mix lucene
and user data.
With MapString, whatever approach you could reserve some key space
for lucene and let user add his stuff on top.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1712) Set default precisionStep for NumericField and NumericRangeFilter

2009-06-22 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722843#action_12722843
 ] 

Earwin Burrfoot commented on LUCENE-1712:
-

Am I misunderstanding something or the problem still persists?
Even if you use a common default, what is your base type - int or long? Are 
floats converted to ints, or to longs?

 Set default precisionStep for NumericField and NumericRangeFilter
 -

 Key: LUCENE-1712
 URL: https://issues.apache.org/jira/browse/LUCENE-1712
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Michael McCandless
Priority: Minor
 Fix For: 2.9


 This is a spinoff from LUCENE-1701.
 A user using Numeric* should not need to understand what's
 under the hood in order to do their indexing  searching.
 They should be able to simply:
 {code}
 doc.add(new NumericField(price, 15.50);
 {code}
 And have a decent default precisionStep selected for them.
 Actually, if we add ctors to NumericField for each of the supported
 types (so the above code works), we can set the default per-type.  I
 think we should do that?
 4 for int and 6 for long was proposed as good defaults.
 The default need not be perfect, as advanced users can always
 optimize their precisionStep, and for users experiencing slow
 RangeQuery performance, NumericRangeQuery with any of the defaults we
 are discussing will be much faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1712) Set default precisionStep for NumericField and NumericRangeFilter

2009-06-22 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722851#action_12722851
 ] 

Earwin Burrfoot commented on LUCENE-1712:
-

Aha! And each time you invoke setFloatValue/setDoubleValue it switches base 
type behind the scenes? Eeek.

 Set default precisionStep for NumericField and NumericRangeFilter
 -

 Key: LUCENE-1712
 URL: https://issues.apache.org/jira/browse/LUCENE-1712
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Michael McCandless
Priority: Minor
 Fix For: 2.9


 This is a spinoff from LUCENE-1701.
 A user using Numeric* should not need to understand what's
 under the hood in order to do their indexing  searching.
 They should be able to simply:
 {code}
 doc.add(new NumericField(price, 15.50);
 {code}
 And have a decent default precisionStep selected for them.
 Actually, if we add ctors to NumericField for each of the supported
 types (so the above code works), we can set the default per-type.  I
 think we should do that?
 4 for int and 6 for long was proposed as good defaults.
 The default need not be perfect, as advanced users can always
 optimize their precisionStep, and for users experiencing slow
 RangeQuery performance, NumericRangeQuery with any of the defaults we
 are discussing will be much faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: 3MB lucene-analyzers.jar?

2009-06-21 Thread Earwin Burrfoot
 But: I do not understand the problems with this JAR file. If somebody really
 wants to have smaller files, one could use some tools, that do it
 automatically on class usage.
 I personally have a couple of usecases for that as I have to work in
 very limited environments. Imagine embedded systems or mobile phones
 where 500 kb is a lot. if you realy need the analyzer you can include
 the additional jar.
Jar Jar Links - special tools for special tasks.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache

2009-06-19 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721787#action_12721787
 ] 

Earwin Burrfoot commented on LUCENE-1701:
-

I vote for factories - escaping back-compat woes by exposing minimum interface.

 Add NumericField and NumericSortField, make plain text numeric parsers public 
 in FieldCache, move trie parsers to FieldCache
 

 Key: LUCENE-1701
 URL: https://issues.apache.org/jira/browse/LUCENE-1701
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index, Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9


 In discussions about LUCENE-1673, Mike  me wanted to add a new NumericField 
 to o.a.l.document specific for easy indexing. An alternative would be to add 
 a NumericUtils.newXxxField() factory, that creates a preconfigured Field 
 instance with norms and tf off, optionally a stored text (LUCENE-1699) and 
 the TokenStream already initialized. On the other hand 
 NumericUtils.newXxxSortField could be moved to NumericSortField.
 I and Yonik tend to use the factory for both, Mike tends to create the new 
 classes.
 Also the parsers for string-formatted numerics are not public in FieldCache. 
 As the new SortField API (LUCENE-1478) makes it possible to support a parser 
 in SortField instantiation, it would be good to have the static parsers in 
 FieldCache public available. SortField would init its member variable to them 
 (instead of NULL), so making code a lot easier (FieldComparator has this ugly 
 null checks when retrieving values from the cache).
 Moving the Trie parsers also as static instances into FieldCache would make 
 the code cleaner and we would be able to hide the hack 
 StopFillCacheException by making it private to FieldCache (currently its 
 public because NumericUtils is in o.a.l.util).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache

2009-06-19 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721830#action_12721830
 ] 

Earwin Burrfoot commented on LUCENE-1701:
-

Mike, I very much agree with everything you said, except factory is less 
consumable than constructor and add stuff to index to handle NumericField.

Out of your three examples the second one is bad, no questions. But first and 
last are absolutely equal in terms of consumability.
Static factories are cool (they allow to switch implementations and 
instantiation logic without changing API) and are as easy to use (probably even 
easier with generics in Java5) as constructors.

If we add some generic storable flags for Lucene fields, this is cool 
(probably), NumericField can then capitalize on it, as well as users writing 
their own NNNFields.
Tying index format to some particular implementation of numerics is bad design. 
Why on earth can't my own split-field (vs single-field as in current Lucene) 
trie-encoded number enjoy the same benefits as NumericField from Lucene core?

bq. By this same logic, should we remove NumericRangeFilter/Query and use
static factories instead?
I do use factory methods for all my queries and filters, and it makes me feel 
warm and fuzzy! :) Under the hood some of them consult FieldInfo to instantiate 
custom-tailored query variants, so I just use range(CREATION_TIME, from, to) 
and don't think if this field is trie-encoded or raw.

Simple things should be simple, okay. Complex things should be simple too, 
argh! :)

 Add NumericField and NumericSortField, make plain text numeric parsers public 
 in FieldCache, move trie parsers to FieldCache
 

 Key: LUCENE-1701
 URL: https://issues.apache.org/jira/browse/LUCENE-1701
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index, Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9


 In discussions about LUCENE-1673, Mike  me wanted to add a new NumericField 
 to o.a.l.document specific for easy indexing. An alternative would be to add 
 a NumericUtils.newXxxField() factory, that creates a preconfigured Field 
 instance with norms and tf off, optionally a stored text (LUCENE-1699) and 
 the TokenStream already initialized. On the other hand 
 NumericUtils.newXxxSortField could be moved to NumericSortField.
 I and Yonik tend to use the factory for both, Mike tends to create the new 
 classes.
 Also the parsers for string-formatted numerics are not public in FieldCache. 
 As the new SortField API (LUCENE-1478) makes it possible to support a parser 
 in SortField instantiation, it would be good to have the static parsers in 
 FieldCache public available. SortField would init its member variable to them 
 (instead of NULL), so making code a lot easier (FieldComparator has this ugly 
 null checks when retrieving values from the cache).
 Moving the Trie parsers also as static instances into FieldCache would make 
 the code cleaner and we would be able to hide the hack 
 StopFillCacheException by making it private to FieldCache (currently its 
 public because NumericUtils is in o.a.l.util).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache

2009-06-19 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721830#action_12721830
 ] 

Earwin Burrfoot edited comment on LUCENE-1701 at 6/19/09 8:50 AM:
--

Mike, I very much agree with everything you said, except factory is less 
consumable than constructor and add stuff to index to handle NumericField.

Out of your three examples the second one is bad, no questions. But first and 
last are absolutely equal in terms of consumability.
Static factories are cool (they allow to switch implementations and 
instantiation logic without changing API) and are as easy to use (probably even 
easier with generics in Java5) as constructors.

If we add some generic storable flags for Lucene fields, this is cool 
(probably), NumericField can then capitalize on it, as well as users writing 
their own NNNFields.
Tying index format to some particular implementation of numerics is bad design. 
Why on earth can't my own split-field (vs single-field as in current Lucene) 
trie-encoded number enjoy the same benefits as NumericField from Lucene core?

bq. By this same logic, should we remove NumericRangeFilter/Query and use 
static factories instead?
I do use factory methods for all my queries and filters, and it makes me feel 
warm and fuzzy! :) Under the hood some of them consult FieldInfo to instantiate 
custom-tailored query variants, so I just use range(CREATION_TIME, from, to) 
and don't think if this field is trie-encoded or raw.

Simple things should be simple, okay. Complex things should be simple too, 
argh! :)

  was (Author: earwin):
Mike, I very much agree with everything you said, except factory is less 
consumable than constructor and add stuff to index to handle NumericField.

Out of your three examples the second one is bad, no questions. But first and 
last are absolutely equal in terms of consumability.
Static factories are cool (they allow to switch implementations and 
instantiation logic without changing API) and are as easy to use (probably even 
easier with generics in Java5) as constructors.

If we add some generic storable flags for Lucene fields, this is cool 
(probably), NumericField can then capitalize on it, as well as users writing 
their own NNNFields.
Tying index format to some particular implementation of numerics is bad design. 
Why on earth can't my own split-field (vs single-field as in current Lucene) 
trie-encoded number enjoy the same benefits as NumericField from Lucene core?

bq. By this same logic, should we remove NumericRangeFilter/Query and use
static factories instead?
I do use factory methods for all my queries and filters, and it makes me feel 
warm and fuzzy! :) Under the hood some of them consult FieldInfo to instantiate 
custom-tailored query variants, so I just use range(CREATION_TIME, from, to) 
and don't think if this field is trie-encoded or raw.

Simple things should be simple, okay. Complex things should be simple too, 
argh! :)
  
 Add NumericField and NumericSortField, make plain text numeric parsers public 
 in FieldCache, move trie parsers to FieldCache
 

 Key: LUCENE-1701
 URL: https://issues.apache.org/jira/browse/LUCENE-1701
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index, Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9


 In discussions about LUCENE-1673, Mike  me wanted to add a new NumericField 
 to o.a.l.document specific for easy indexing. An alternative would be to add 
 a NumericUtils.newXxxField() factory, that creates a preconfigured Field 
 instance with norms and tf off, optionally a stored text (LUCENE-1699) and 
 the TokenStream already initialized. On the other hand 
 NumericUtils.newXxxSortField could be moved to NumericSortField.
 I and Yonik tend to use the factory for both, Mike tends to create the new 
 classes.
 Also the parsers for string-formatted numerics are not public in FieldCache. 
 As the new SortField API (LUCENE-1478) makes it possible to support a parser 
 in SortField instantiation, it would be good to have the static parsers in 
 FieldCache public available. SortField would init its member variable to them 
 (instead of NULL), so making code a lot easier (FieldComparator has this ugly 
 null checks when retrieving values from the cache).
 Moving the Trie parsers also as static instances into FieldCache would make 
 the code cleaner and we would be able to hide the hack 
 StopFillCacheException by making it private to FieldCache (currently its 
 public because NumericUtils is in o.a.l.util).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment

[jira] Commented: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache

2009-06-19 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722060#action_12722060
 ] 

Earwin Burrfoot commented on LUCENE-1701:
-

bq. Someday maybe I'll convince you to donate this schema layer on top of 
Lucene
It's not generic enough to be of use for every user of Lucene, and it doesn't 
aim to be such. It also evolves, and donating something to Lucene means casting 
it in concrete.
So that's not me being greedy or lazy (okay, maybe a little bit of the latter), 
it's simply not public-quality (as I understand it) code.
I can share the design if anybody's interested, but everyone's coping with it 
themselves it seems.

Solr has its own schema approach, and it has its merits and downfalls compared 
to mine. That's what is nice, we're able to use the same library in differing 
ways, and it doesn't force its sense of 'best practices' on us. 

bq. But I hope there are SOME named classes in there and not all static factory 
methods returning anonymous untyped impls.
SOME of them aren't static :-D

bq. We shouldn't weaken trie's integration to core just because others have 
private implementations.
You shouldn't integrate into core something that is not core functionality. 
Think microkernels.
It's strange seeing you drive CSFs, custom indexing chains, pluggability 
everywhere on one side, and trying to add some weird custom properties into 
index that are tightly interwoven with only one of possible numeric 
implementations on the other side.

bq. Design for today.
And spend two years deprecating and supporting today's designs after you get a 
better thing tomorrow. Back-compat Lucene-style and agile design aren't 
something that marries well.

bq. What's important is that we don't weaken those private implementations with 
trie's addition, and I don't think our approach here has done that.
You're weakening Lucene itself by introducing too much coupling between its 
components.

IndexReader/Writer pair is a good example of what I'm arguing against. A dusty 
closet of microfeatures that are tightly interwoven into a complex 
hard-to-maintain mess with zillions of (possibly broken) control paths - 
remember mutable deletes/norms+clone/reopen permutations? It could be avoided 
if IR/W were kept to the bare minimum (which most people are going to use), and 
more advanced features were built on top of it, not in the same place.

NRT seems to tread the same path, and I'm not sure it's going to win that much 
turnaround time after newly-introduced per-segment collection. Some time ago I 
finished a first version of IR plugins, and enjoy pretty low reopen times 
(field/facet/filter cache warmups included). (Yes, I'm going to open an issue 
for plugins once they stabilize enough)

{quote}
 If we add some generic storable flags for Lucene fields, this is cool 
 (probably), NumericField can then capitalize on it, as well as users writing 
 their own NNNFields.
+1 Wanna make a patch?
{quote}

No, I'd like to continue IR cleanup and play with positionIncrement companion 
value that could enable true multiword synonyms. 
I know, I know, it's do-a-cracy. But it's not an excuse for hacks.

 Add NumericField and NumericSortField, make plain text numeric parsers public 
 in FieldCache, move trie parsers to FieldCache
 

 Key: LUCENE-1701
 URL: https://issues.apache.org/jira/browse/LUCENE-1701
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index, Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: NumericField.java


 In discussions about LUCENE-1673, Mike  me wanted to add a new NumericField 
 to o.a.l.document specific for easy indexing. An alternative would be to add 
 a NumericUtils.newXxxField() factory, that creates a preconfigured Field 
 instance with norms and tf off, optionally a stored text (LUCENE-1699) and 
 the TokenStream already initialized. On the other hand 
 NumericUtils.newXxxSortField could be moved to NumericSortField.
 I and Yonik tend to use the factory for both, Mike tends to create the new 
 classes.
 Also the parsers for string-formatted numerics are not public in FieldCache. 
 As the new SortField API (LUCENE-1478) makes it possible to support a parser 
 in SortField instantiation, it would be good to have the static parsers in 
 FieldCache public available. SortField would init its member variable to them 
 (instead of NULL), so making code a lot easier (FieldComparator has this ugly 
 null checks when retrieving values from the cache).
 Moving the Trie parsers also as static instances into FieldCache would make 
 the code cleaner and we would be able to hide the hack

[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-17 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720619#action_12720619
 ] 

Earwin Burrfoot commented on LUCENE-1630:
-

I wasn't following the issue closely, so this question might by silly - how 
does out-of-order scoring/collection marry with filters?
If I remember right, filter/scorer intersection relies on proper orderness.

 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding overriding these new methods.
 One other optimization that was discussed in LUCENE-1593 is to expose a 
 topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
 will be called, and additionally add a start() method to DISI. That will 
 allow Scorers to initialize either on start() or score(Collector). This was 
 proposed mainly because of BS and BS2 which check if they are initialized in 
 every call to next(), skipTo() and score(). Personally I prefer to see that 
 in a separate issue, following that one (as it might add methods to 
 QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java

Re: madvise(ptr, len, MADV_SEQUENTIAL)

2009-06-16 Thread Earwin Burrfoot
Except, you don't know the size of the file to be written upfront.
One probable solution is to map output file in pages. As a
complementary solution you can map a huge area of the file, and hope
few real memory is allocated by OS unless you actually write all over
that area.
Dunno. The idea of using mmapped write has stopped looking interesting to me.

On Tue, Jun 16, 2009 at 18:32, Uwe Schindleru...@thetaphi.de wrote:
 But to use it, we should change MMapDirectory to also use the mapping when
 writing to files. I thought about it, it is very simple to implement (just
 copy the IndexInput and change all gets() to sets())

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Tuesday, June 16, 2009 4:22 PM
 To: java-dev@lucene.apache.org
 Cc: Alan Bateman; nio-disc...@openjdk.java.net
 Subject: Re: madvise(ptr, len, MADV_SEQUENTIAL)

 Lucene could really make use of this method.  When a segment merge
 takes place, we can read  write many GB of data, which without
 madvise on many OSs would effectively flush the IO cache (thus hurting
 our search performance).

 Mike

 On Mon, Jun 15, 2009 at 6:01 PM, Jason
 Rutherglenjason.rutherg...@gmail.com wrote:
  Thanks Alan.
 
  I cross posted this to the Lucene dev list where we are discussing using
  madvise for minimizing unnecessary IO cache usage when merging segments
  (where we really want the newly merged segments in the IO cache rather
 than
  the old segment files).
 
  How would the advise method work?  Would there need to be a hint in the
  FileChannel.map method?
 
  -J
 
  On Mon, Jun 15, 2009 at 12:36 AM, Alan Bateman alan.bate...@sun.com
 wrote:
 
  Jason Rutherglen wrote:
 
  Is there going to be a way to do this in the new Java IO APIs?
 
  Good question, as it has come up a few times and is needed for some
  important use-cases. A while back I looked into adding a
  MappedByteBuffer#advise method to allow the application provide hints
 on the
  expected usage but didn't complete it. We should probably look at this
 again
  for jdk7.
 
  -Alan.
 
 
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal for changing the backwards-compatibility policy

2009-06-16 Thread Earwin Burrfoot
Oh yes! Again!
+1

One point is missing. What about incompatible behavioral changes that
do not touch API and file format?
Like posIncr=0 at the first token in stream, or analyzer fixes, or
something along these lines.

Are we free to introduce them in a minor release without warning, or
are we going to warn one release before the change, or do we provide
old-behaviour switches that are deprecated since their birth, or we
keep said switches for a couple of major releases?


On Tue, Jun 16, 2009 at 14:37, Michael Buschbusch...@gmail.com wrote:
 Probably everyone is thinking right now Oh no! Not again!. I admit I
 didn't fully read the incredibly long recent thread about
 backwards-compatibility, so maybe what I'm about to propose has been
 proposed already. In that case my apologies in advance.

 Rather than discussing our current backwards-compatibility policy
 again, I'd like to make here a concrete proposal for changing the policy
 after Lucene 3.0 is released.

 I'll call X.Y - X+1.0 a 'major release', X.Y - X.Y+1 a
 'minor release' and X.Y.Z - X.Y.Z+1 a 'bugfix release'. (we can later
 use different names; just for convenience here...)

 1. The file format backwards-compatiblity policy will remain unchanged;
   i.e. Lucene X.Y supports reading all indexes written with Lucene
   X-1.Y. That means Lucene 4.0 will not have to be able to read 2.x
   indexes.

 2. Deprecated public and protected APIs can be removed if they have
   been released in at least one major or minor release. E.g. an 3.1
   API can be released as deprecated in 3.2 and removed in 3.3 or 4.0
   (if 4.0 comes after 3.2).

 3. No public or protected APIs are changed in a bugfix release; except
   if a severe bug can't be changed otherwise.

 4. Each release will have release notes with a new section
   Incompatible changes, which lists, as the names says, all changes that
   break backwards compatibility. The list should also have information
   about how to convert to the new API. I think the eclipse releases
   have such a release notes section.


 The big change here apparently is 2. Consider the current situation:
 We can release e.g. the new TokenStream API with 2.9; then we can
 remove it a month later in 3.0, while still complying with our current
 backwards-compatibility policy. A transition period of one month is
 very short for such an important API. On the other hand, a transition
 period of presumably 2 years, until 4.0 is released, seems very long
 to stick with a deprecated API that clutters the APIs and docs. With
 the proposed change, we couldn't do that. Given our current release
 schedule, the transition period would at least be 6-9 months, which
 seems a very reasonable timeframe.

 We should also not consider 2. as a must. I.e. we don't *have* to
 deprecate after one major or minor release already. We could for a
 very popular API like the TokenStream API send a mail to java-user,
 asking if people need more transition time and be flexible.

 I think this policy is much more dynamic and flexible, but should
 still give our users enough confidence. It also removes the need to
 do things just for the sake of the current policy rather than because
 they make the most sense, like our somewhat goofy X.9 releases. :)

 Just to make myself clear: I think we should definitely stick with our
 2.9 and 3.0 plans and change the policy afterwards.

 My +1 to all 4 points above.

 -Michael


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-16 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720231#action_12720231
 ] 

Earwin Burrfoot commented on LUCENE-1673:
-

bq. This is that baking in a specific implementation into the index format that 
I don't like.
+many

bq. I do agree that retrieving a doc is already buggy, in that various things 
are lost from your index time doc (a well known issue at this point!)
How on earth is it buggy? You're working with an inverted index, you aren't 
supposed to get original document from it in the first place. It's like saying 
a hash function is buggy because it is not reversible.

The less coupling various lucene components have on each other - the better. If 
you'd like to have end-to-end experience for numeric fields, build something 
schema-like and put it in contribs. If it's hard to build - Lucene core is to 
blame, it's not extensible enough. From my experience, for that purporse it's 
okay as it is.

 Move TrieRange to core
 --

 Key: LUCENE-1673
 URL: https://issues.apache.org/jira/browse/LUCENE-1673
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch


 TrieRange was iterated many times and seems stable now (LUCENE-1470, 
 LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to 
 its default FieldTypes (SOLR-940) and if possible I want to move it to core 
 before release of 2.9.
 Before this can be done, there are some things to think about:
 # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how 
 should they be called in core? I would suggest to leave it as it is. On the 
 other hand, if this keeps our only numeric query implementation, we could 
 call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here 
 are problems). Same for the TokenStreams and Filters.
 # Maybe the pairs of classes for indexing and searching should be moved into 
 one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The 
 problem here: ctors must be able to pass int, long, double, float as range 
 parameters. For the end user, mixing these 4 types in one class is hard to 
 handle. If somebody forgets to add a L to a long, it suddenly instantiates a 
 int version of range query, hitting no results and so on. Same with other 
 types. Maybe accept java.lang.Number as parameter (because nullable for 
 half-open bounds) and one enum for the type.
 # TrieUtils move into o.a.l.util? or document or?
 # Move TokenStreams into o.a.l.analysis, ShiftAttribute into 
 o.a.l.analysis.tokenattributes? Somewhere else?
 # If we rename the classes, should Solr stay with Trie (because there are 
 different impls)?
 # Maybe add a subclass of AbstractField, that automatically creates these 
 TokenStreams and omits norms/tf per default for easier addition to Document 
 instances?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719539#action_12719539
 ] 

Earwin Burrfoot commented on LUCENE-1630:
-

I like the last option most. Creating dummy scorer looks ugly to me, and looks 
like it will cause more problems of the same kind in the future.

 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding overriding these new methods.
 One other optimization that was discussed in LUCENE-1593 is to expose a 
 topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
 will be called, and additionally add a start() method to DISI. That will 
 allow Scorers to initialize either on start() or score(Collector). This was 
 proposed mainly because of BS and BS2 which check if they are initialized in 
 every call to next(), skipTo() and score(). Personally I prefer to see that 
 in a separate issue, following that one (as it might add methods to 
 QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719539#action_12719539
 ] 

Earwin Burrfoot edited comment on LUCENE-1630 at 6/15/09 5:36 AM:
--

I like the last option (move scoresOutOfOrder to Weight) most. Creating dummy 
scorer looks ugly to me, and looks like it will cause more problems of the same 
kind in the future.


  was (Author: earwin):
I like the last option most. Creating dummy scorer looks ugly to me, and 
looks like it will cause more problems of the same kind in the future.
  
 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding overriding these new methods.
 One other optimization that was discussed in LUCENE-1593 is to expose a 
 topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
 will be called, and additionally add a start() method to DISI. That will 
 allow Scorers to initialize either on start() or score(Collector). This was 
 proposed mainly because of BS and BS2 which check if they are initialized in 
 every call to next(), skipTo() and score(). Personally I prefer to see that 
 in a separate issue, following that one (as it might add methods to 
 QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online

Re: Payloads and TrieRangeQuery

2009-06-14 Thread Earwin Burrfoot
 Just to throw something out, the new Token API is not very consumable in my
 experience. The old one was very intuitive and very easy to follow the code.

 I've had to refigure out what the heck was going on with the new one more
 than once now. Writing some example code with it is hard to follow or
 justify to a new user.

 What was the big improvement with it again? Advanced, expert custom indexing
 chains require less casting or something right?

 I dunno - anyone else have any thoughts now that the new API has been in
 circulation for some time?
I have an advanced, expert custom indexing chain, and it's still not
ported over the new API.
It's counter intuitive alright, with names not really saying what's
going on (please, for an AttributeSource, whose Attribute is it?
Attribute is a quality of 'something', but that 'something' is amiss),
but the biggest problem for me is that it capitalizes on the idea of
token stream even further, making filters whose output is several
times the input tokenwise, or which need to inspect a number of tokens
before emitting something - much harder to write. I most probably
missed something and there IS a way not to trash your memory with
non-reused linkedhashmaps, but than again, there's no pointers.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

2009-06-14 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719322#action_12719322
 ] 

Earwin Burrfoot commented on LUCENE-1488:
-

bq. But this can't replace ArabicAnalyzer completely, because ArabicAnalyzer 
stems arabic text in a language-specific way, which has a huge effect on 
retrieval quality for Arabic language text.
What about separating word-tokenizing from morphological processing?

 issues with standardanalyzer on multilingual text
 -

 Key: LUCENE-1488
 URL: https://issues.apache.org/jira/browse/LUCENE-1488
 Project: Lucene - Java
  Issue Type: Wish
  Components: contrib/analyzers
Reporter: Robert Muir
Priority: Minor
 Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt


 The standard analyzer in lucene is not exactly unicode-friendly with regards 
 to breaking text into words, especially with respect to non-alphabetic 
 scripts.  This is because it is unaware of unicode bounds properties.
 I actually couldn't figure out how the Thai analyzer could possibly be 
 working until i looked at the jflex rules and saw that codepoint range for 
 most of the Thai block was added to the alphanum specification. defining the 
 exact codepoint ranges like this for every language could help with the 
 problem but you'd basically be reimplementing the bounds properties already 
 stated in the unicode standard. 
 in general it looks like this kind of behavior is bad in lucene for even 
 latin, for instance, the analyzer will break words around accent marks in 
 decomposed form. While most latin letter + accent combinations have composed 
 forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
 i suppose). 
 I've got a partially tested standardanalyzer that uses icu Rule-based 
 BreakIterator instead of jflex. Using this method you can define word 
 boundaries according to the unicode bounds properties. After getting it into 
 some good shape i'd be happy to contribute it for contrib but I wonder if 
 theres a better solution so that out of box lucene will be more friendly to 
 non-ASCII text. Unfortunately it seems jflex does not support use of these 
 properties such as [\p{Word_Break = Extend}] so this is probably the major 
 barrier.
 Thanks,
 Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-10 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718009#action_12718009
 ] 

Earwin Burrfoot commented on LUCENE-1453:
-

bq. As the Filter is just a deprecated wrapper, that is removed in 3.0, I think 
reusing SegmentReader.Ref for that is ok. 
Ok. Maybe you are right.

bq. Closeable is a Java 1.5 interface only, so this refactoring must wait until 
3.0, but the idea is good!
We can introduce our own Closeable, and replace it with java native in 3.0, 
thank gods the interface is simple :)

 When reopen returns a new IndexReader, both IndexReaders may now control the 
 lifecycle of the underlying Directory which is managed by reference counting
 -

 Key: LUCENE-1453
 URL: https://issues.apache.org/jira/browse/LUCENE-1453
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
Reporter: Mark Miller
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4.1, 2.9

 Attachments: Failing-testcase-LUCENE-1453.patch, 
 LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
 LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch


 Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
 when IndexReader.reopen shares a Directory with a created IndexReader and 
 closeDirectory is true, FSDirectory's ref management will see two decrements 
 for one increment. You can end up getting an AlreadyClosed exception on the 
 Directory when the IndexReader is open.
 I have a test I'll put up. A solution seems fairly straightforward (at least 
 in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Earwin Burrfoot
 And this information about the trie
 structure and where payloads are should be stored in FieldInfos.

 As is the case today, the info is encoded in the class you use (and
 it's settings)... no need to add it to the index structure.  In any
 case, it's a completely different issue and shouldn't be tied to
 TrieRange improvements.

 The problem is, because the details of Trie* at index time affect
 what's in each segment, this information needs to be stored per
 segment.

And then, when you merge segments indexed with different Trie*
settings, you need to convert them to some common form.
Sounds like something too complex and with minimum returns.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1607) String.intern() faster alternative

2009-06-10 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718198#action_12718198
 ] 

Earwin Burrfoot commented on LUCENE-1607:
-

bq. but I was waiting for some kind of feedback if people in general thought it 
was the right approach. It introduces another static, and people tend to not 
like that.
Just forgot somehow about this issue.
You're right about static, it's not clear how and when to initialize it, plus 
you introduce some public classes we'll be unable to change/remove later.
I still have a feeling we should expose a single static method - intern() and 
hide implementation away, possibly tuning it to be advantageous for thousands 
of fields, and degrading to raw String.intern() level if there are more fields.

I'm going to be away from AC power for three days starting now, so I won't be 
able to reply until then.

 String.intern() faster alternative
 --

 Key: LUCENE-1607
 URL: https://issues.apache.org/jira/browse/LUCENE-1607
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Earwin Burrfoot
 Fix For: 2.9

 Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
 LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
 LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch


 By using our own interned string pool on top of default, String.intern() can 
 be greatly optimized.
 On my setup (java 6) this alternative runs ~15.8x faster for already interned 
 strings, and ~2.2x faster for 'new String(interned)'
 For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Earwin Burrfoot
  * Was the field even indexed w/ Trie, or indexed as simple text?
    It's useful to know this automatically at search time, so eg a
    RangeQuery can do the right thing by default.  FieldInfos seems
    like the natural place to store this.  It's basically Lucene's
    per-segment write-once schema.  Eg we use this to record did any
    token in this field have a Payload?, which is analogous.
This should really be in a schema of some kind (like in my project for
instance).
Why do you do autodetection for tries, but recently removed it for FieldCache?
Things should be concise, either store all settings in the index (and
die in the process), or don't store them there at all.

  * We have a bug (or an important improvement) in how Trie encodes
    terms that we need to fix.  This one is not easy to handle, since
    such a change could alter the term order, and merging segments
    then becomes problematic.  Not sure how to handle that.  Yonik,
    has Solr ever had to make a change to NumberUtils?
There are cases when reindexing is inevitable. What so horrible about
it anyway? Even if you have a humongous index, you can rebuild it in a
matter of days, and you don't do this often.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717657#action_12717657
 ] 

Earwin Burrfoot commented on LUCENE-1453:
-

Patch looks fine. I read the last one, LUCENE-1453-with-FSDir-open.patch.

 When reopen returns a new IndexReader, both IndexReaders may now control the 
 lifecycle of the underlying Directory which is managed by reference counting
 -

 Key: LUCENE-1453
 URL: https://issues.apache.org/jira/browse/LUCENE-1453
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
Reporter: Mark Miller
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4.1, 2.9

 Attachments: Failing-testcase-LUCENE-1453.patch, 
 LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
 LUCENE-1453.patch, LUCENE-1453.patch


 Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
 when IndexReader.reopen shares a Directory with a created IndexReader and 
 closeDirectory is true, FSDirectory's ref management will see two decrements 
 for one increment. You can end up getting an AlreadyClosed exception on the 
 Directory when the IndexReader is open.
 I have a test I'll put up. A solution seems fairly straightforward (at least 
 in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Some thoughts around the use of reader.isDeleted and hasDeletions

2009-06-09 Thread Earwin Burrfoot
 Actually: I think we should also change IndexReader.document to not
 check if it's deleted?  (Renaming it to something like rawDocument(),
 storedDocument(), something, in the process, and deprecating the old
 one).
Yup. After all the most common use-case is to load a document after
finding it in one or another way. Pretty hard to come up with id of a
deleted document.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717769#action_12717769
 ] 

Earwin Burrfoot commented on LUCENE-1453:
-

bq. I think it should (be closed in a finally clause).

Then there's the next question of the same sort, but probably belonging in a 
separate issue. If we close a DR and one of SR throws an exception - should we 
close the others (currently we don't)? What is the right way, in general, of 
handling IOExceptions on IR close? Can we retry the close? What does this 
exception mean?

 When reopen returns a new IndexReader, both IndexReaders may now control the 
 lifecycle of the underlying Directory which is managed by reference counting
 -

 Key: LUCENE-1453
 URL: https://issues.apache.org/jira/browse/LUCENE-1453
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
Reporter: Mark Miller
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4.1, 2.9

 Attachments: Failing-testcase-LUCENE-1453.patch, 
 LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
 LUCENE-1453.patch, LUCENE-1453.patch


 Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
 when IndexReader.reopen shares a Directory with a created IndexReader and 
 closeDirectory is true, FSDirectory's ref management will see two decrements 
 for one increment. You can end up getting an AlreadyClosed exception on the 
 Directory when the IndexReader is open.
 I have a test I'll put up. A solution seems fairly straightforward (at least 
 in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717823#action_12717823
 ] 

Earwin Burrfoot commented on LUCENE-1678:
-

Second this. Though I lost any hope for sane Lucene release/compat rules.

 Deprecate Analyzer.tokenStream
 --

 Key: LUCENE-1678
 URL: https://issues.apache.org/jira/browse/LUCENE-1678
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9


 The addition of reusableTokenStream to the core analyzers unfortunately broke 
 back compat of external subclasses:
 
 http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html
 On upgrading, such subclasses would silently not be used anymore, since 
 Lucene's indexing invokes reusableTokenStream.
 I think we should should at least deprecate Analyzer.tokenStream, today, so 
 that users see deprecation warnings if their classes override this method.  
 But going forward when we want to change the API of core classes that are 
 extended, I think we have to  introduce entirely new classes, to keep back 
 compatibility.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717862#action_12717862
 ] 

Earwin Burrfoot commented on LUCENE-1678:
-

bq. If there are sane/smart ways to change our back compat policy, I think you 
have seen that no one would object.
It's not a matter of finding a smart way. It is a matter of sacrifice that has 
to be made and readiness to take the blame for decision that can be unpopular 
with someone.
You go zealously for back-compat - you sacrifice readability/maintainability of 
your code but free users from any troubles when they want to 'simply upgrade'. 
You adopt more relaxed policy - you sacrifice users' time, but in return you 
gain cleaner codebase and new stuff can be written and used faster.
There's no way to ride two horses at once.

Some people are comfortable with current policies. Few cringe when they hear 
things like above. Most theoretically want to relax the rules. Nobody's ready 
to give up something for it.

Okay, there's an escape hatch I (and someone else) mentioned on the list 
before. Adopting a fixed release cycle with small intervals between releases 
(compared to what we have now). Fixed - as in, releases are made each N months 
instead of when everyone feels they finished and polished up all their pet 
projects and there's nothing else exciting to do. That way we can keep the 
current policy, but deletion-through-deprecation approach will work, at last!
This solution is halfassed, I can already see discussions like That was a big 
change, let's keep the deprecates around longer, say - for a couple of 
releases., it doesn't solve good-name-thrashing problem, as you have to go 
through two rounds of deprecation to change semantics on something, but keep 
the name.
But this is something better than what we have now, a-a-and this is something 
that needs commiter backing.

bq. Thats a great indication to me that the issue is not simple.
The issue is simple, the choice is not. And maintaining status quo is free.

bq. Giving up is really not the answer though
It is the answer. I have no moral right to hammer my ideals into heads that did 
tremendously more for the project, than I did. And maintaining a patch queue 
over Lucene trunk is not 'that' hard.


 Deprecate Analyzer.tokenStream
 --

 Key: LUCENE-1678
 URL: https://issues.apache.org/jira/browse/LUCENE-1678
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9


 The addition of reusableTokenStream to the core analyzers unfortunately broke 
 back compat of external subclasses:
 
 http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html
 On upgrading, such subclasses would silently not be used anymore, since 
 Lucene's indexing invokes reusableTokenStream.
 I think we should should at least deprecate Analyzer.tokenStream, today, so 
 that users see deprecation warnings if their classes override this method.  
 But going forward when we want to change the API of core classes that are 
 extended, I think we have to  introduce entirely new classes, to keep back 
 compatibility.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717866#action_12717866
 ] 

Earwin Burrfoot commented on LUCENE-1453:
-

Two suggestions:

Factor out RefCount class and use it everywhere through Lucene. I see at least 
one identical to yours in SegmentReader. Would be easier to replace all these 
uses with AtomicInteger later.

Looking at the new unsightly loop in doClose(), what if we change all Lucene 
closeable classes to implement java.io.Closeable and create a static utility 
method(-s) that receives a bunch of Closeables (an array, iterable, vararg in 
1.5) and tries to close them all?
The method should be nullsafe (so you can skip != null checks) and will 
handle/rethrow exceptions. The most proper way to handle exceptions is probably 
this - rethrow original exception if it is the only one (be it Runtime or IO), 
if there's more - gather all exceptions and wrap them into a special 
IOException subclass that concatenates their messages and keeps them around, so 
they are inspectable at debug-time or if you implement special treatment for 
that exception in your code.
This method can be reused in a heap of places later, SR.doClose() comes first 
to mind.

I can do the latter one in a separate patch to close this issue faster.

 When reopen returns a new IndexReader, both IndexReaders may now control the 
 lifecycle of the underlying Directory which is managed by reference counting
 -

 Key: LUCENE-1453
 URL: https://issues.apache.org/jira/browse/LUCENE-1453
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
Reporter: Mark Miller
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4.1, 2.9

 Attachments: Failing-testcase-LUCENE-1453.patch, 
 LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
 LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch


 Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
 when IndexReader.reopen shares a Directory with a created IndexReader and 
 closeDirectory is true, FSDirectory's ref management will see two decrements 
 for one increment. You can end up getting an AlreadyClosed exception on the 
 Directory when the IndexReader is open.
 I have a test I'll put up. A solution seems fairly straightforward (at least 
 in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-09 Thread Earwin Burrfoot
@Mark:
 Okay, there's an escape hatch I (and someone else) mentioned on the list
 before. Adopting a fixed release cycle with small intervals between releases
 (compared to what we have now). Fixed - as in, releases are made each N
 months instead of when everyone feels they finished and polished up all
 their pet projects and there's nothing else exciting to do. That way we can
 keep the current policy, but deletion-through-deprecation approach will
 work, at last!
 Thats a big change. I think its a nice idea, but I don't know how practical
 it is. Most of us are basically volunteering time for this type of thing.
 Even still, with the pace of development lately (and you can be sure that
 the current pace is a *new* thing, Lucene did not always have this amount of
 activity), it might make sense.
You're missing the most important point. Fixed schedule means that the
only reason not to do a release is the total abscence of changes.
No matter how much or how few changes are released each time, fixed
schedule gives you predictable lifecycle for all your
deprecation/back-compat needs.

 But that idea needs a champion, and frankly
 I don't have the time right now (it wouldn't likely be in my realm anyway).
 And thats probably the deal with most others. They have work and/or other
 itches that are higher priority than championing a big change.
And here we got at one of the roots of the problem. The root that is
going to stay.

 bq. Giving up is really not the answer though
 It is the answer. I have no moral right to hammer my ideals into heads
 that did tremendously more for the project, than I did. And maintaining a
 patch queue over Lucene trunk is not 'that' hard.
 Its not about hammering your ideals - that almost feels like what you are
 doing, but frankly, it doesn't help. If you even just keep prompting the
 issue as it dies away you will likely keep progress going. There is a
 solution that everyone will accept. I promise you that. Its more work than
 it looks to find that solution and guide it to fruition though. Its fully
 possible, and I'm sure it will happen eventually. Would have beat even money
 that Mike had it a few weeks ago. No dice it looks though ;)
I consciously took a bit of an extremist stance in hope to shift the
mean. Okay, will try ditching it in favour of gently bugging people
like Grant did in the comment that spawned this discussion. :)

@Yonik:
 You go zealously for back-compat - you sacrifice readability/maintainability 
 of your code but free users from any troubles when they want to 'simply 
 upgrade'. You adopt more relaxed policy - you sacrifice users' time, but in 
 return you gain cleaner codebase and new stuff can be written and used 
 faster.
 Not sure I agree with that - if changes become too easy you can get a
 thrashing effect... change just because someone thought it was a
 little better can lead to more chaos.
You're right.
I'm not advocating anarchy. :) But currently we are afraid to break
anything at all, and that is as far away from juste milieu as the
chaos you speak of.

 IMO, changes to interfaces should be clearly better than what existed before.
Recent changes to DISI? Were they clearly for the better?

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes

2009-06-07 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717089#action_12717089
 ] 

Earwin Burrfoot commented on LUCENE-1648:
-

As LUCENE-1651 is now committed, this issue can be resolved.

 when you clone or reopen an IndexReader with pending changes, the new reader 
 doesn't commit the changes
 ---

 Key: LUCENE-1648
 URL: https://issues.apache.org/jira/browse/LUCENE-1648
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1648-followup.patch, LUCENE-1648-followup.patch, 
 LUCENE-1648.patch


 While working on LUCENE-1647, I came across this issue... we are failing to 
 carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-07 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717107#action_12717107
 ] 

Earwin Burrfoot commented on LUCENE-1453:
-

bq. There are two possibilities to fix this:
Vote for leave them open. Yes, it breaches the contract, but the breach is 
controlled (and thus harmless) and we get rid of some weird code (=possible 
point of failure) without introducing new.
There is a way to notice change in DirectoryReader behaviour, but it is too 
unrealistic:
{code}
IndexReader r = IndexReader.open(/path/to/index);
.
Directory d = r.directory(); // you have to get directory reference as you're 
not the one who created it
.
r.close();
.
d.doSomething(); // and EXPECT this call to fail with exception
{code}

 When reopen returns a new IndexReader, both IndexReaders may now control the 
 lifecycle of the underlying Directory which is managed by reference counting
 -

 Key: LUCENE-1453
 URL: https://issues.apache.org/jira/browse/LUCENE-1453
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
Reporter: Mark Miller
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4.1, 2.9

 Attachments: Failing-testcase-LUCENE-1453.patch, LUCENE-1453.patch, 
 LUCENE-1453.patch, LUCENE-1453.patch


 Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
 when IndexReader.reopen shares a Directory with a created IndexReader and 
 closeDirectory is true, FSDirectory's ref management will see two decrements 
 for one increment. You can end up getting an AlreadyClosed exception on the 
 Directory when the IndexReader is open.
 I have a test I'll put up. A solution seems fairly straightforward (at least 
 in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (SOLR-706) Fast auto-complete suggestions

2009-06-07 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717108#action_12717108
 ] 

Earwin Burrfoot commented on SOLR-706:
--

When I did autocompletion for my project, simple java TreeMap had superior 
memory characteristics and almost the same performance as tries. I think it's 
not worth inventing something elaborate for this task.

 Fast auto-complete suggestions
 --

 Key: SOLR-706
 URL: https://issues.apache.org/jira/browse/SOLR-706
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar
 Fix For: 1.5


 A lot of users have suggested that facet.prefix in Solr is not the most 
 efficient way to implement an auto-complete suggestion feature. A fast 
 in-memory trie like structure has often been suggested instead. This issue 
 aims to incorporate a faster/efficient way to answer auto-complete queries in 
 Solr.
 Refer to the following discussion on solr-dev -- 
 http://markmail.org/message/sjjojrnroo3msugj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-236) Field collapsing

2009-06-07 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717110#action_12717110
 ] 

Earwin Burrfoot commented on SOLR-236:
--

I have implemented collapsing on a high-volume project of mine in much less 
flexible, but more practical manner.

Part I. You have to guarantee that all documents having the same value of 
collapse-field are dropped into Lucene index as a sequential batch. That 
guarantees they get sequential docIds, and with some more work - that they all 
end up in the same segment.
Part II. When doing collection you always get docIds in sequential order, and 
thus, thanks to Part I you get the docs-to-be-collapsed already grouped by 
collapse-field, even before you drop the docs into PriorityQueue to sort them.

Cons:
You can only collapse on a single predetermined at index creation time field.
If one document changes, you have to reindex all docs that have the same 
collapse-field value, so it's best if you have either low update/add rates, or 
few documents sharing the same collapse-field value.

Pros:
The CPU and memory costs for collapsing compared to usual search are very close 
to zero and do not depend on index size/total docs found.
The same idea works with new Lucene per-segment collection and in distributed 
mode (sharded index).
Within collapsed group you can sort hits however you want, and select one that 
will represent the group for usual sort/paging.
The implementation is not brain-dead simple, but nears it.

 Field collapsing
 

 Key: SOLR-236
 URL: https://issues.apache.org/jira/browse/SOLR-236
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Emmanuel Keller
 Fix For: 1.5

 Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
 collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
 collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-solr-236-2.patch, 
 field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, 
 field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, 
 field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
 field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch, 
 SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
 solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch


 This patch include a new feature called Field collapsing.
 Used in order to collapse a group of results with similar value for a given 
 field to a single entry in the result set. Site collapsing is a special case 
 of this, where all results for a given web site is collapsed into one or two 
 entries in the result set, typically with an associated more documents from 
 this site link. See also Duplicate detection.
 http://www.fastsearch.com/glossary.aspx?m=48amid=299
 The implementation add 3 new query parameters (SolrParams):
 collapse.field to choose the field used to group results
 collapse.type normal (default value) or adjacent
 collapse.max to select how many continuous results are allowed before 
 collapsing
 TODO (in progress):
 - More documentation (on source code)
 - Test cases
 Two patches:
 - field_collapsing.patch for current development version
 - field_collapsing_1.1.0.patch for Solr-1.1.0
 P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: IR static methods

2009-06-04 Thread Earwin Burrfoot
Index/Commit/SegmentMetadata? Several classes, as you can reflect on
various levels of the index.

Some offtopic - SegmentInfo/SegmentsInfo should really be named
Segment/Segments. That's exactly what these objects represent.
You don't use names like PreparedStatementInfo or FileInfo or IntegerInfo :)

On Fri, Jun 5, 2009 at 02:21, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

 We have .
 $ ff \*Info\*java
 ./src/java/org/apache/lucene/index/FieldInfo.java
 ./src/java/org/apache/lucene/index/TermVectorOffsetInfo.java
 ./src/java/org/apache/lucene/index/SegmentInfo.java
 ./src/java/org/apache/lucene/index/TermInfosWriter.java
 ./src/java/org/apache/lucene/index/TermInfo.java
 ./src/java/org/apache/lucene/index/FieldInfos.java
 ./src/java/org/apache/lucene/index/SegmentMergeInfo.java
 ./src/java/org/apache/lucene/index/TermInfosReader.java
 ./src/java/org/apache/lucene/index/SegmentInfos.java

 How about IndexInfo?

  Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: Earwin Burrfoot ear...@gmail.com
 To: java-dev@lucene.apache.org
 Sent: Wednesday, June 3, 2009 8:08:50 AM
 Subject: IR static methods

 I have a strong desire to remove all these static methods from IR -
 lastModified, getCurrentVersion, getCommitUserData, indexExists.
 But haven't found a good place for them yet.

 Directory - is a bad place, it shouldn't concern itself with details
 of what exactly is stored inside, it should think of 'how' it is
 stored.
 IndexReader - is bad, it is too heavyweight to be created for getting
 something simple once.

 We should probably create some new lightweight class that provides a
 kind of reflection for the index? Mod dates, versions, userdata,
 existence, sizes, deletions, whatever. Both per-index and per-segment.
 Essentially it is a wrapper over SegmentInfos that allows us to keep
 them hidden (and thus easily changeable), and provides users with more
 concise and adequate interface.

 Any thoughts?

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.

2009-06-03 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715836#action_12715836
 ] 

Earwin Burrfoot commented on LUCENE-1651:
-

Seems the patch didn't apply completely. Your line numbers are off, also 
directory/readOnly are now members of SegmentReader, no way they can't be seen:

{code}
class SegmentReader extends IndexReader implements Cloneable {
  protected Directory directory;
  protected boolean readOnly;

  private String segment;
  private SegmentInfo si;
  private int readBufferSize;
{code}

Here's corresponding part of the patch, I bet $Id$ is the reason.
{code}
-/**
- * @version $Id$
- */
-class SegmentReader extends DirectoryIndexReader {
+/** @version $Id$ */
+class SegmentReader extends IndexReader implements Cloneable {
+  protected Directory directory;
+  protected boolean readOnly;
+
{code}

 Make IndexReader.open() always return MSR to simplify (re-)opens.
 -

 Key: LUCENE-1651
 URL: https://issues.apache.org/jira/browse/LUCENE-1651
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.9
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1651-tag.patch, LUCENE-1651.patch, 
 LUCENE-1651.patch


 As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
 always return MSR, even for single-segment indexes.
 While theoretically valid in the past (if you make sure to keep your index 
 constantly optimized) this feature is made practically obsolete by 
 per-segment collection.
 The patch somewhat de-hairies (re-)open logic for MSR/SR.
 SR no longer needs an ability to pose as toplevel directory-owning IR.
 All related logic is moved from DIR to MSR.
 DIR becomes almost empty, and copying two or three remaining fields over to 
 MSR/SR, I remove it.
 Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
 introducing SR.getOnlySegmentReader static package-private method.
 Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
 (partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.

2009-06-03 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715908#action_12715908
 ] 

Earwin Burrfoot commented on LUCENE-1651:
-

bq. Patch looks good Earwin, thanks!
I believe the readers can be cleaned up further, but I'm short on time and 
don't want to delay it for another week or two, and then rebase it against 
updated trunk once again. Might as well do that under a separate issue.

bq. I think we should now rename MultiSegmentReader to DirectoryIndexReader?
Maybe DirectoryReader instead of DirectoryIndexReader? But all three are in 
fact okay with me, I really don't have any preference here.


 Make IndexReader.open() always return MSR to simplify (re-)opens.
 -

 Key: LUCENE-1651
 URL: https://issues.apache.org/jira/browse/LUCENE-1651
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.9
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1651-tag.patch, LUCENE-1651.patch, 
 LUCENE-1651.patch


 As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
 always return MSR, even for single-segment indexes.
 While theoretically valid in the past (if you make sure to keep your index 
 constantly optimized) this feature is made practically obsolete by 
 per-segment collection.
 The patch somewhat de-hairies (re-)open logic for MSR/SR.
 SR no longer needs an ability to pose as toplevel directory-owning IR.
 All related logic is moved from DIR to MSR.
 DIR becomes almost empty, and copying two or three remaining fields over to 
 MSR/SR, I remove it.
 Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
 introducing SR.getOnlySegmentReader static package-private method.
 Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
 (partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



IR static methods

2009-06-03 Thread Earwin Burrfoot
I have a strong desire to remove all these static methods from IR -
lastModified, getCurrentVersion, getCommitUserData, indexExists.
But haven't found a good place for them yet.

Directory - is a bad place, it shouldn't concern itself with details
of what exactly is stored inside, it should think of 'how' it is
stored.
IndexReader - is bad, it is too heavyweight to be created for getting
something simple once.

We should probably create some new lightweight class that provides a
kind of reflection for the index? Mod dates, versions, userdata,
existence, sizes, deletions, whatever. Both per-index and per-segment.
Essentially it is a wrapper over SegmentInfos that allows us to keep
them hidden (and thus easily changeable), and provides users with more
concise and adequate interface.

Any thoughts?

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1672) Deprecate all String/File ctors/opens in IndexReader/IndexWriter/IndexSearcher

2009-06-03 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715944#action_12715944
 ] 

Earwin Burrfoot commented on LUCENE-1672:
-

bq. I will later try to solve this problem with the closeDir inside the 
different IndexReaders (but maybe Earwin has done it already in LUCENE-1651)
My issue removes closeDir from SegmentReader, as it cannot 'own' a directory 
anymore. MSR-to-be-DirectoryReader still has this flag.

 Deprecate all String/File ctors/opens in IndexReader/IndexWriter/IndexSearcher
 --

 Key: LUCENE-1672
 URL: https://issues.apache.org/jira/browse/LUCENE-1672
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.9
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1672.patch, LUCENE-1672.patch


 During investigation of LUCENE-1658, I found out, that even LUCENE-1453 is 
 not completely fixed.
 As 1658 deprecates all FSDirectory.getDirectory() static factories, we should 
 not use them anymore. As the user is now free to choose the correct directory 
 implementation using direct instantiation or using FSDir.open() he should no 
 longer use all ctors/methods in IndexWriter/IndexReader/IndexSearcher  Co. 
 that simply take path names as String or File and always instantiate the 
 Directory himself.
 LUCENE-1453 currently works for the cached directory implementations from 
 FSDir.getDirectory, but not with uncached, non refcounting FSDirs. Sometime 
 reopen() closes the directory (as far as I see, when a SegmentReader changes 
 to a MultiSegmentReader and/or deletes apply). This is hard to track. In 
 Lucene 3.0 we then can remove the whole bunch of closeDirectory 
 parameters/fields in these classes and simply do not care anymore about 
 closing directories.
 To remove this closeDirectory parameter now (before 3.0) and also fix 1453 
 correctly, an additional idea would be to change these factories that take 
 the File/String to return the IndexReader wrapped by a FilteredIndexReader, 
 that keeps track on closing the underlying directory after close and reopen. 
 This is simplier than passing this boolean between different 
 DirectoryIndexReader instances. The small performance impact by wrapping with 
 FilterIndexReader should not be so bad, because the method is deprecated and 
 we can state, that it is better to user the factory method with Directory 
 parameter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1672) Deprecate all String/File ctors/opens in IndexReader/IndexWriter/IndexSearcher

2009-06-03 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715962#action_12715962
 ] 

Earwin Burrfoot commented on LUCENE-1672:
-

bq. And DirectoryIR/MSR still have this Flag, but reopening a MSR always 
returns a MSR again (even if it only consists of one segment)?
Exactly.

 Deprecate all String/File ctors/opens in IndexReader/IndexWriter/IndexSearcher
 --

 Key: LUCENE-1672
 URL: https://issues.apache.org/jira/browse/LUCENE-1672
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.9
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1672.patch, LUCENE-1672.patch


 During investigation of LUCENE-1658, I found out, that even LUCENE-1453 is 
 not completely fixed.
 As 1658 deprecates all FSDirectory.getDirectory() static factories, we should 
 not use them anymore. As the user is now free to choose the correct directory 
 implementation using direct instantiation or using FSDir.open() he should no 
 longer use all ctors/methods in IndexWriter/IndexReader/IndexSearcher  Co. 
 that simply take path names as String or File and always instantiate the 
 Directory himself.
 LUCENE-1453 currently works for the cached directory implementations from 
 FSDir.getDirectory, but not with uncached, non refcounting FSDirs. Sometime 
 reopen() closes the directory (as far as I see, when a SegmentReader changes 
 to a MultiSegmentReader and/or deletes apply). This is hard to track. In 
 Lucene 3.0 we then can remove the whole bunch of closeDirectory 
 parameters/fields in these classes and simply do not care anymore about 
 closing directories.
 To remove this closeDirectory parameter now (before 3.0) and also fix 1453 
 correctly, an additional idea would be to change these factories that take 
 the File/String to return the IndexReader wrapped by a FilteredIndexReader, 
 that keeps track on closing the underlying directory after close and reopen. 
 This is simplier than passing this boolean between different 
 DirectoryIndexReader instances. The small performance impact by wrapping with 
 FilterIndexReader should not be so bad, because the method is deprecated and 
 we can state, that it is better to user the factory method with Directory 
 parameter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.

2009-06-03 Thread Earwin Burrfoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-1651:


Attachment: LUCENE-1651-tag.patch
LUCENE-1651.patch

Argh! The rename broke test-tag again :) in new and innovative ways.
New patches attached.

 Make IndexReader.open() always return MSR to simplify (re-)opens.
 -

 Key: LUCENE-1651
 URL: https://issues.apache.org/jira/browse/LUCENE-1651
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.9
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1651-tag.patch, LUCENE-1651-tag.patch, 
 LUCENE-1651.patch, LUCENE-1651.patch, LUCENE-1651.patch


 As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
 always return MSR, even for single-segment indexes.
 While theoretically valid in the past (if you make sure to keep your index 
 constantly optimized) this feature is made practically obsolete by 
 per-segment collection.
 The patch somewhat de-hairies (re-)open logic for MSR/SR.
 SR no longer needs an ability to pose as toplevel directory-owning IR.
 All related logic is moved from DIR to MSR.
 DIR becomes almost empty, and copying two or three remaining fields over to 
 MSR/SR, I remove it.
 Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
 introducing SR.getOnlySegmentReader static package-private method.
 Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
 (partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.

2009-06-03 Thread Earwin Burrfoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-1651:


Attachment: LUCENE-1651.patch

One more version, applies against current trunk without fuzzy hunk matching.

 Make IndexReader.open() always return MSR to simplify (re-)opens.
 -

 Key: LUCENE-1651
 URL: https://issues.apache.org/jira/browse/LUCENE-1651
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.9
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1651-tag.patch, LUCENE-1651-tag.patch, 
 LUCENE-1651.patch, LUCENE-1651.patch, LUCENE-1651.patch, LUCENE-1651.patch


 As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
 always return MSR, even for single-segment indexes.
 While theoretically valid in the past (if you make sure to keep your index 
 constantly optimized) this feature is made practically obsolete by 
 per-segment collection.
 The patch somewhat de-hairies (re-)open logic for MSR/SR.
 SR no longer needs an ability to pose as toplevel directory-owning IR.
 All related logic is moved from DIR to MSR.
 DIR becomes almost empty, and copying two or three remaining fields over to 
 MSR/SR, I remove it.
 Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
 introducing SR.getOnlySegmentReader static package-private method.
 Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
 (partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Enhance StandardTokenizer to support words which will not be tokenized

2009-06-03 Thread Earwin Burrfoot
Not sure you can easily marry generated JFlex grammar and
runtime-provided list of protected words.
I took the approach of creating tokens for punctuation inside my
tokenizer and later gluing them with nearby text tokens or dropping
from the stream with a tokenfilter.

On Wed, Jun 3, 2009 at 20:10, Grant Ingersoll gsing...@apache.org wrote:
 You'd have to modify the JFlex grammar.  I'd suggest adding in a generic
 protected words approach whereby you can pass in a list of protected
 words.

 This would be a nice patch/improvement.

 -Grant

 On Jun 3, 2009, at 4:07 AM, ami dudu wrote:


 Hi, I'm using a StandardTokenizer which do great job for me but i need to
 enhance it somehow to consider words like c++ c#, .net as is and not
 tokenized it into c or net.
 I know that there are other tokenizers such as KeywordTokenizer and
 WhitespaceTokenizer but they do not include the StandardTokenizer  logic.
 Any ideas on what is the best way to add this enhancement?

 Thanks,
 Amid
 --
 View this message in context:
 http://www.nabble.com/Enhance-StandardTokenizer-to-support-words-which-will-not-be-tokenized-tp23849495p23849495.html
 Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org


 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
 Solr/Lucene:
 http://www.lucidimagination.com/search


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-03 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715973#action_12715973
 ] 

Earwin Burrfoot commented on LUCENE-1630:
-

Searcher is supposed to be a little cherry of userfriendliness atop a glass of 
Lucene murky internals, ain't it?
I mean, even you had to be explained the ways of Query, Weight and Scorer, what 
would a Lucene neophyte do if we remove his beloved convenience methods?

 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding overriding these new methods.
 One other optimization that was discussed in LUCENE-1593 is to expose a 
 topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
 will be called, and additionally add a start() method to DISI. That will 
 allow Scorers to initialize either on start() or score(Collector). This was 
 proposed mainly because of BS and BS2 which check if they are initialized in 
 every call to next(), skipTo() and score(). Personally I prefer to see that 
 in a separate issue, following that one (as it might add methods to 
 QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e

[jira] Created: (LUCENE-1677) Remove GCJ IndexReader specializations

2009-06-03 Thread Earwin Burrfoot (JIRA)
Remove GCJ IndexReader specializations
--

 Key: LUCENE-1677
 URL: https://issues.apache.org/jira/browse/LUCENE-1677
 Project: Lucene - Java
  Issue Type: Task
Reporter: Earwin Burrfoot
 Fix For: 2.9


These specializations are outdated, unsupported, most probably pointless due to 
the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you are 
going to ask people on java-user, anybody replied that they need it?). While 
giving nothing, they make SegmentReader instantiation code look real ugly.

If nobody objects, I'm going to post a patch that removes these from Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-02 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715509#action_12715509
 ] 

Earwin Burrfoot commented on LUCENE-1630:
-

You can't, because Weights produced from same Query are different for different 
indexes.
You can probably modify Query inplace for a given index, produce some scorers, 
do scoring, then modify Query for another index, produce scorers, etc..
But now your Query is no longer thread-safe, and I can't reuse it from 
different threads.

So with all its strange looks the trio of Q, W, S is still the best approach if 
you ask me.

 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding overriding these new methods.
 One other optimization that was discussed in LUCENE-1593 is to expose a 
 topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
 will be called, and additionally add a start() method to DISI. That will 
 allow Scorers to initialize either on start() or score(Collector). This was 
 proposed mainly because of BS and BS2 which check if they are initialized in 
 every call to next(), skipTo() and score(). Personally I prefer to see that 
 in a separate issue, following that one (as it might add methods to 
 QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online

[jira] Commented: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.

2009-06-02 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715672#action_12715672
 ] 

Earwin Burrfoot commented on LUCENE-1651:
-

Hm.. okay, I've got back to work on this patch. To fix tests relying on getting 
SR from IR.open() on trunk I introduced a package-private utility method that 
extracts SR from MSR if it is the only one there. The tests in tags/XXX don't 
see this method, should I backport it somewhere there?

 Make IndexReader.open() always return MSR to simplify (re-)opens.
 -

 Key: LUCENE-1651
 URL: https://issues.apache.org/jira/browse/LUCENE-1651
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.9
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1651.patch


 As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
 always return MSR, even for single-segment indexes.
 While theoretically valid in the past (if you make sure to keep your index 
 constantly optimized) this feature is made practically obsolete by 
 per-segment collection.
 The patch somewhat de-hairies (re-)open logic for MSR/SR.
 SR no longer needs an ability to pose as toplevel directory-owning IR.
 All related logic is moved from DIR to MSR.
 DIR becomes almost empty, and copying two or three remaining fields over to 
 MSR/SR, I remove it.
 Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
 introducing SR.getOnlySegmentReader static package-private method.
 Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
 (partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.

2009-06-02 Thread Earwin Burrfoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-1651:


Attachment: LUCENE-1651-tag.patch
LUCENE-1651.patch

Here are the patches for current lucene trunk and back compat tag.

 Make IndexReader.open() always return MSR to simplify (re-)opens.
 -

 Key: LUCENE-1651
 URL: https://issues.apache.org/jira/browse/LUCENE-1651
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.9
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1651-tag.patch, LUCENE-1651.patch, 
 LUCENE-1651.patch


 As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
 always return MSR, even for single-segment indexes.
 While theoretically valid in the past (if you make sure to keep your index 
 constantly optimized) this feature is made practically obsolete by 
 per-segment collection.
 The patch somewhat de-hairies (re-)open logic for MSR/SR.
 SR no longer needs an ability to pose as toplevel directory-owning IR.
 All related logic is moved from DIR to MSR.
 DIR becomes almost empty, and copying two or three remaining fields over to 
 MSR/SR, I remove it.
 Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
 introducing SR.getOnlySegmentReader static package-private method.
 Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
 (partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory

2009-06-01 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715008#action_12715008
 ] 

Earwin Burrfoot commented on LUCENE-1658:
-

I told you, Java mmap doesn't work on Windows.
And please, don't use the unmap hack! If it doesn't work, it doesn't work. 
Let's for all windows versions use SimpleFSD.
Look, what are you going to do if you unmap a buffer and then access it by 
accident? Crash JVM?

 Absorb NIOFSDirectory into FSDirectory
 --

 Key: LUCENE-1658
 URL: https://issues.apache.org/jira/browse/LUCENE-1658
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658.patch, LUCENE-1658.patch, 
 LUCENE-1658.patch


 I think whether one uses java.io.* vs java.nio.* or eventually
 java.nio2.*, or some other means, is an under-the-hood implementation
 detail of FSDirectory and doesn't merit a whole separate class.
 I think FSDirectory should be the core class one uses when one's index
 is in the filesystem.
 So, I'd like to deprecate NIOFSDirectory, absorbing it into
 FSDirectory, and add a setting useNIO to FSDirectory.  It should
 default to true for non-Windows OSs, because it gives far better
 concurrent performance on all platforms but Windows (due to known Sun
 JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory

2009-06-01 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715016#action_12715016
 ] 

Earwin Burrfoot edited comment on LUCENE-1658 at 6/1/09 1:14 AM:
-

bq. The buffer is nulled directly after unmapping. 
Really? Let me quote some code (MacOS, Java 1.6):

unsafe.freeMemory(address);
address = 0;
Bits.unreserveMemory(capacity);

Does windows version differ? What we see here is 'zeroing', not 'nulling'. When 
doing get/set, buffer never checks for address to have sense, so the next 
access will yield a GPF :)
The guys from Sun explained the absence of unmap() in the original design - the 
only way of closing mapped buffer and not getting unpredictable behaviour is to 
introduce a synchronized isClosed check on each read/write operation, which 
kills the performance even if the sync method used is just a volatile variable.

  was (Author: earwin):
Really? Let me quote some code (MacOS, Java 1.6):

unsafe.freeMemory(address);
address = 0;
Bits.unreserveMemory(capacity);

Does windows version differ? What we see here is 'zeroing', not 'nulling'. When 
doing get/set, buffer never checks for address to have sense, so the next 
access will yield a GPF :)
The guys from Sun explained the absence of unmap() in the original design - the 
only way of closing mapped buffer and not getting unpredictable behaviour is to 
introduce a synchronized isClosed check on each read/write operation, which 
kills the performance even if the sync method used is just a volatile variable.
  
 Absorb NIOFSDirectory into FSDirectory
 --

 Key: LUCENE-1658
 URL: https://issues.apache.org/jira/browse/LUCENE-1658
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, 
 LUCENE-1658.patch, LUCENE-1658.patch


 I think whether one uses java.io.* vs java.nio.* or eventually
 java.nio2.*, or some other means, is an under-the-hood implementation
 detail of FSDirectory and doesn't merit a whole separate class.
 I think FSDirectory should be the core class one uses when one's index
 is in the filesystem.
 So, I'd like to deprecate NIOFSDirectory, absorbing it into
 FSDirectory, and add a setting useNIO to FSDirectory.  It should
 default to true for non-Windows OSs, because it gives far better
 concurrent performance on all platforms but Windows (due to known Sun
 JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory

2009-06-01 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715016#action_12715016
 ] 

Earwin Burrfoot commented on LUCENE-1658:
-

Really? Let me quote some code (MacOS, Java 1.6):

unsafe.freeMemory(address);
address = 0;
Bits.unreserveMemory(capacity);

Does windows version differ? What we see here is 'zeroing', not 'nulling'. When 
doing get/set, buffer never checks for address to have sense, so the next 
access will yield a GPF :)
The guys from Sun explained the absence of unmap() in the original design - the 
only way of closing mapped buffer and not getting unpredictable behaviour is to 
introduce a synchronized isClosed check on each read/write operation, which 
kills the performance even if the sync method used is just a volatile variable.

 Absorb NIOFSDirectory into FSDirectory
 --

 Key: LUCENE-1658
 URL: https://issues.apache.org/jira/browse/LUCENE-1658
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, 
 LUCENE-1658.patch, LUCENE-1658.patch


 I think whether one uses java.io.* vs java.nio.* or eventually
 java.nio2.*, or some other means, is an under-the-hood implementation
 detail of FSDirectory and doesn't merit a whole separate class.
 I think FSDirectory should be the core class one uses when one's index
 is in the filesystem.
 So, I'd like to deprecate NIOFSDirectory, absorbing it into
 FSDirectory, and add a setting useNIO to FSDirectory.  It should
 default to true for non-Windows OSs, because it gives far better
 concurrent performance on all platforms but Windows (due to known Sun
 JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory

2009-06-01 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715018#action_12715018
 ] 

Earwin Burrfoot commented on LUCENE-1658:
-

Ah! You was referring to your code. It's not thread-safe still. Someone could 
access the closed buffer before it sees the now-null reference to it.
You also employ the hack on non-windows machines, that work quite well without 
it. What for?

 Absorb NIOFSDirectory into FSDirectory
 --

 Key: LUCENE-1658
 URL: https://issues.apache.org/jira/browse/LUCENE-1658
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, 
 LUCENE-1658.patch, LUCENE-1658.patch


 I think whether one uses java.io.* vs java.nio.* or eventually
 java.nio2.*, or some other means, is an under-the-hood implementation
 detail of FSDirectory and doesn't merit a whole separate class.
 I think FSDirectory should be the core class one uses when one's index
 is in the filesystem.
 So, I'd like to deprecate NIOFSDirectory, absorbing it into
 FSDirectory, and add a setting useNIO to FSDirectory.  It should
 default to true for non-Windows OSs, because it gives far better
 concurrent performance on all platforms but Windows (due to known Sun
 JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory

2009-06-01 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715026#action_12715026
 ] 

Earwin Burrfoot commented on LUCENE-1658:
-

I tested on MacOS:

Invalid memory access of location 8b55a000 rip=0110c367
* Here JVM quietly dies. non-null return code, all threads are killed, no 
diagnostic files created.

 Absorb NIOFSDirectory into FSDirectory
 --

 Key: LUCENE-1658
 URL: https://issues.apache.org/jira/browse/LUCENE-1658
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, 
 LUCENE-1658.patch, LUCENE-1658.patch


 I think whether one uses java.io.* vs java.nio.* or eventually
 java.nio2.*, or some other means, is an under-the-hood implementation
 detail of FSDirectory and doesn't merit a whole separate class.
 I think FSDirectory should be the core class one uses when one's index
 is in the filesystem.
 So, I'd like to deprecate NIOFSDirectory, absorbing it into
 FSDirectory, and add a setting useNIO to FSDirectory.  It should
 default to true for non-Windows OSs, because it gives far better
 concurrent performance on all platforms but Windows (due to known Sun
 JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory

2009-06-01 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715027#action_12715027
 ] 

Earwin Burrfoot commented on LUCENE-1658:
-

bq. It uses less virtual memory :)
64bit systems have an abundance of said valuable resource. Why taint them with 
dangerous hacks for the sake of zero returns?

 Absorb NIOFSDirectory into FSDirectory
 --

 Key: LUCENE-1658
 URL: https://issues.apache.org/jira/browse/LUCENE-1658
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, 
 LUCENE-1658.patch, LUCENE-1658.patch


 I think whether one uses java.io.* vs java.nio.* or eventually
 java.nio2.*, or some other means, is an under-the-hood implementation
 detail of FSDirectory and doesn't merit a whole separate class.
 I think FSDirectory should be the core class one uses when one's index
 is in the filesystem.
 So, I'd like to deprecate NIOFSDirectory, absorbing it into
 FSDirectory, and add a setting useNIO to FSDirectory.  It should
 default to true for non-Windows OSs, because it gives far better
 concurrent performance on all platforms but Windows (due to known Sun
 JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory

2009-06-01 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715057#action_12715057
 ] 

Earwin Burrfoot commented on LUCENE-1658:
-

bq. I'm a bit nervous about creating MMapDirectory automatically for any OS, 
not just Windows.
It's almost okay for 64bit systems.

bq. The hack also saves transient disk space, on all systems, right?
That's a nice catch. Now I have some of the non-buggy-but-weird behaviour my 
app exhibits explained.

bq. But they have a 64 bit buffer, so you could use it instead of many buffers.
They don't. When NIO2 project was merged into OpenJDK, they left some stuff 
unmerged, including 64bit buffers. Currently they aren't present in OpenJDK and 
Java7 preview builds, and not even a rough estimate is given on whether they 
are going to make it through, and when.

bq. Maybe we should move this hack to contrib ( a class that extends 
MMapDirectory by adding a close method) with a big warning!
I support this. The hack has some merits if carefully applied, but is outright 
too dangerous to ship it as default.

 Absorb NIOFSDirectory into FSDirectory
 --

 Key: LUCENE-1658
 URL: https://issues.apache.org/jira/browse/LUCENE-1658
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, 
 LUCENE-1658.patch, LUCENE-1658.patch


 I think whether one uses java.io.* vs java.nio.* or eventually
 java.nio2.*, or some other means, is an under-the-hood implementation
 detail of FSDirectory and doesn't merit a whole separate class.
 I think FSDirectory should be the core class one uses when one's index
 is in the filesystem.
 So, I'd like to deprecate NIOFSDirectory, absorbing it into
 FSDirectory, and add a setting useNIO to FSDirectory.  It should
 default to true for non-Windows OSs, because it gives far better
 concurrent performance on all platforms but Windows (due to known Sun
 JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory

2009-06-01 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715063#action_12715063
 ] 

Earwin Burrfoot edited comment on LUCENE-1658 at 6/1/09 4:16 AM:
-

bq. On a couple of projects I've worked in, they were very reluctant to having 
packages allocate memory outside the JVM, and that's my understanding of memory 
mapped buffers.
mmap does not allocate memory. It allocates address space, and uses the same 
disk cache system already has.
For example, you can't cause OOM in your (or another co-existing) app with 
mmaps (except eating up your own address space on 32bit systems).

bq. But if you decide to include MMapDir in that auto-create logic, I hope 
there will be a way to instantiate a specific FSDir, in case we'll have 
problems with that logic.
Public constructors for all D variants are a must, and for me they are the best 
that this patch has to offer :)

  was (Author: earwin):
bq. On a couple of projects I've worked in, they were very reluctant to 
having packages allocate memory outside the JVM, and that's my understanding of 
memory mapped buffers.
mmap does not allocate memory. It allocates address space, and uses the same 
disk cache system already has.
For example, you can't cause OOM in your (or another co-existing) app with 
mmaps (except eating up your own address space on 32bit systems).
  
 Absorb NIOFSDirectory into FSDirectory
 --

 Key: LUCENE-1658
 URL: https://issues.apache.org/jira/browse/LUCENE-1658
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, 
 LUCENE-1658.patch, LUCENE-1658.patch


 I think whether one uses java.io.* vs java.nio.* or eventually
 java.nio2.*, or some other means, is an under-the-hood implementation
 detail of FSDirectory and doesn't merit a whole separate class.
 I think FSDirectory should be the core class one uses when one's index
 is in the filesystem.
 So, I'd like to deprecate NIOFSDirectory, absorbing it into
 FSDirectory, and add a setting useNIO to FSDirectory.  It should
 default to true for non-Windows OSs, because it gives far better
 concurrent performance on all platforms but Windows (due to known Sun
 JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory

2009-06-01 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715063#action_12715063
 ] 

Earwin Burrfoot commented on LUCENE-1658:
-

bq. On a couple of projects I've worked in, they were very reluctant to having 
packages allocate memory outside the JVM, and that's my understanding of memory 
mapped buffers.
mmap does not allocate memory. It allocates address space, and uses the same 
disk cache system already has.
For example, you can't cause OOM in your (or another co-existing) app with 
mmaps (except eating up your own address space on 32bit systems).

 Absorb NIOFSDirectory into FSDirectory
 --

 Key: LUCENE-1658
 URL: https://issues.apache.org/jira/browse/LUCENE-1658
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, 
 LUCENE-1658.patch, LUCENE-1658.patch


 I think whether one uses java.io.* vs java.nio.* or eventually
 java.nio2.*, or some other means, is an under-the-hood implementation
 detail of FSDirectory and doesn't merit a whole separate class.
 I think FSDirectory should be the core class one uses when one's index
 is in the filesystem.
 So, I'd like to deprecate NIOFSDirectory, absorbing it into
 FSDirectory, and add a setting useNIO to FSDirectory.  It should
 default to true for non-Windows OSs, because it gives far better
 concurrent performance on all platforms but Windows (due to known Sun
 JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



<    1   2   3   4   5   6   7   >