[jira] Created: (LUCENE-1572) luceneweb

2009-03-26 Thread kysnail (JIRA)
luceneweb 
--

 Key: LUCENE-1572
 URL: https://issues.apache.org/jira/browse/LUCENE-1572
 Project: Lucene - Java
  Issue Type: Bug
  Components: Examples
Affects Versions: 2.4
 Environment: Windows XP
Reporter: kysnail
Priority: Minor


Lucene versin : lucene-2.4.0 
According to the reference doc, can't run the luceneweb correctly. 
But if use version (lucene-1.4.3 ), it  can working properly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Michael McCandless
Mark Miller markrmil...@gmail.com wrote:
 bq. I personally don't understand why MRHC was invented in the first place.

 The evolution of MRHC is in the comments of LUCENE-1483 - a lot of comments
 to wade through though.

MRHC was created because simply adding setNextReader to HC would break
back compat, because collect(...) is called on the un-rebased doc.  Ie
we need a new class so we can tell that it will handle re-basing the
doc itself.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Michael McCandless
Shai Erera ser...@gmail.com wrote:

 The difference is for the new code, it's an upcast, which catches any
 errors at compile time, not run time.  The compiler determines that
 the class implements the required interface.

 I still don't understand how a compiler can detect at compilation time that
 a HitCollector instance that is used in method A, and is casted to a
 TopDocsOutput instance by calling method B (from A) is actually ok ... I may
 be missing some Java information here, but I simply don't see how that
 happens at compilation time instead of at runtime ...

I may have lost the context here... but here's what I thought we were
talking about.

If we choose the interface option (adding a ProvidesTopDocsResults
(say) interface), then you would create method
renderResults(ProvidesTopDocResults).

Then, any collector implementing that interface could be safely passed
in, as the upcast is done at compile time not runtime.

 So in fact the internal Lucene code expects only MRHC from a certain point,
 and so even if I wrote a HC and passed it on Searcher, it's still converted
 to MRHC, with an empty setNextReader() method implementation. That's why I
 meant that HC is already deprecated, whether it's marked as deprecated or
 not.

The setNextReader() impl is not empty; it does the re-basing of docID
on behalf of the HC.

 What you say about deprecating HC to me is unnecessary. Simply pull
 setNextReader up with an empty implementation, get rid of all the
 instanceof, casting and wrapping code and you're fine. Nothing is broken.
 All works well and better (instanceof, casting and wrapping have their
 overheads).
 Isn't that right?

I think we need to deprecate HC, in favor of MRHC (or if we can think
of a better name... ResultCollector?).

 Regarding interfaces .. I don't think they're that bad. Perhaps a different
 viewing angle might help. Lucene processes a query and fires events. Today
 it fires an event every time a doc's score has been computed, and recently
 every time it starts to process a different reader. HitCollector is a
 listener implementation on the doc-score event, while MRHC is a listener
 on both.
 To me, interfaces play just nicely here. Assume that you have the following
 interfaces:
 - DocScoreEvent
 - ChangeReaderEvent
 - EndProcessingEvent (thrown when the query has finished processing -
 perhaps to aid collectors to free resources)
 - any other events you foresee?
 The Lucene code receives a HitCollector which listens on all events. In the
 future, Lucene might throw other events, but still get a HitCollector. Those
 methods will check for instanceof, and you as a user will know that if you
 want to catch those events, you pass in a collector implementation which
 does. Those events cannot of course be main-stream events (like
 DocScoreEvent), but new ones, perhaps even experimental.
 Since HitCollector is a concrete class, we can always add new interfaces to
 it in the future with empty implementations?

I agree interfaces clearly have useful properties, but their achilles
heel for Lucene in all but the most trivial needs is the
non-back-compatibility when you want to add a method to the interface.
There have been a number of java-dev discussons on this problem.

So, I think something like this:

  * Deprecate HitCollector in favor of MultiReaderHitCollector (any
better name here?) abstract class.  If you want to make a fully
custom HitCollector, you subclass this calss.

Let's change MRHC's collect to take only an [unbased] docID, and
expose a setScorer(Scorer) method.  Then if the collector needs
score it can call Scorer.score().

  * Subclass that to create an abstract tracks top N results
collector (TopDocsCollector?  TopHitsCollector?)

  * Subclass TopDocsCollector to a final, fast top N sorted by score
collector (exists already: TopScoreDocCollector)

  * Subclass TopDocsCollector to a final, fast top N sorted by field
collector (exists already: TopFieldCollector)

  * Subclass TopDocsCollector to a you provide your own pqueue and we
collect top N according to it collector (does not yet exist --
name?).  This is the way forward for existing subclasses of
TopDocCollector.

Shai do you want to take a first cut at making a patch?  Can you open
an issue?  Thanks.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1572) luceneweb

2009-03-26 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1572:
---

Fix Version/s: 2.9

 luceneweb 
 --

 Key: LUCENE-1572
 URL: https://issues.apache.org/jira/browse/LUCENE-1572
 Project: Lucene - Java
  Issue Type: Bug
  Components: Examples
Affects Versions: 2.4
 Environment: Windows XP
Reporter: kysnail
Priority: Minor
 Fix For: 2.9


 Lucene versin : lucene-2.4.0 
 According to the reference doc, can't run the luceneweb correctly. 
 But if use version (lucene-1.4.3 ), it  can working properly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread DM Smith


On Mar 26, 2009, at 6:55 AM, Michael McCandless wrote:


 think we need to deprecate HC, in favor of MRHC (or if we can think
of a better name... ResultCollector?).


I like your suggestion for the name.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Shai Erera
You're right ... I missed that. My fault :)

On Thu, Mar 26, 2009 at 12:18 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Mark Miller markrmil...@gmail.com wrote:
  bq. I personally don't understand why MRHC was invented in the first
 place.
 
  The evolution of MRHC is in the comments of LUCENE-1483 - a lot of
 comments
  to wade through though.

 MRHC was created because simply adding setNextReader to HC would break
 back compat, because collect(...) is called on the un-rebased doc.  Ie
 we need a new class so we can tell that it will handle re-basing the
 doc itself.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Earwin Burrfoot
I'd say it is a bad name. Raw hit is way far from being result of a search.

If you're already breaking back compat with 3.0 release (by
incrementing java version), maybe its worthy to break it in some more
places, just so ugly names like MRHC and special code paths that check
for n-year-old interfaces won't haunt us for the next century.

On Thu, Mar 26, 2009 at 14:15, DM Smith dmsmith...@gmail.com wrote:

 On Mar 26, 2009, at 6:55 AM, Michael McCandless wrote:

  think we need to deprecate HC, in favor of MRHC (or if we can think
 of a better name... ResultCollector?).

 I like your suggestion for the name.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1573) IndexWriter does not do the right thing when a Thread is interrupt()'d

2009-03-26 Thread Michael McCandless (JIRA)
IndexWriter does not do the right thing when a Thread is interrupt()'d
--

 Key: LUCENE-1573
 URL: https://issues.apache.org/jira/browse/LUCENE-1573
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1, 2.4, 2.3.2, 2.3.1, 2.3, 2.2, 2.1, 2.0.0, 1.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9


Spinoff from here:


http://www.nabble.com/Deadlock-with-concurrent-merges-and-IndexWriter--Lucene-2.4--to22714290.html

When a Thread is interrupt()'d while inside Lucene, there is a risk currently 
that it will cause a spinloop and starve BG merges from completing.

Instead, when possible, we should allow interruption.  But unfortunately for 
back-compat, we will need to wrap the exception in an unchecked version.  In 
3.0 we can change that to simply throw InterruptedException.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Michael McCandless
Earwin Burrfoot ear...@gmail.com wrote:

 I'd say it is a bad name. Raw hit is way far from being result of a search.

First off, from Lucene's standpoint, the docID *is* the result of the
search.  Your application will do further things (load titles, do
higlighting, etc.) with that result.

Second off, since ResultCollector is an abstract base class, it would
be subclassed to concrete versions that do more interesting things
(call Scorer.score(), etc) so as to make up what your application
considers the result.

So I understand your objection, but I still feel ResultCollector is an OK name.

Or do you have an alternative?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Shai Erera

 I may have lost the context here... but here's what I thought we were
 talking about.

 If we choose the interface option (adding a ProvidesTopDocsResults
 (say) interface), then you would create method
 renderResults(ProvidesTopDocResults).

 Then, any collector implementing that interface could be safely passed
 in, as the upcast is done at compile time not runtime.


Consider this code snippet:

HitCollector hc = condition? new TopDocCollector() : TopFieldDocCollector();
searcher.search(hc);

The problem is that I need a base class for both collectors. If I use the
interface ProvidesTopDocsResults, then I cannot pass it to searcher, w/o
casting to HitCollector. If I use HitCollector, then I need to cast it
before passing it into rederResults(). Only when both class have the same
base class which is also a HitCollector, I don't need any casting. I.e.,
after I submit a patch that develops what we've agreed on, the code can look
like this:

TopResultsCollector trc = condition ? new TopScoreDocCollector() : new
TopFieldCollector();
searcher.search(trc);
renderResults(trc);

Here I can pass 'trc' to both methods since it both a HitCollector and a
TopResultsCollector. That's what I was missing in your proposal.

Shai do you want to take a first cut at making a patch?  Can you open
 an issue?  Thanks.


I can certainly do that. I think the summary of the steps make sense. I'll
check if TopScoreDocCollector and TopFieldCollector can also extend that
you provide your own pqueue and we collect top N according to it
collector, passing a null PQ and extending topDocs().

I also would like to propose an additional method to topDocs(), topDocs(int
start, int howMany) which will be more efficient to call in case of paging
through results. The reason is that topDocs() pops everything from the PQ,
then allocates a ScoreDoc[] array of size of number of results and returns a
TopDocs object. You can then choose just the ones you need.
On the other hand, topDocs(start, howMany) would allocate exactly the size
of array needed. E.g., in case someone pages through results 91-100, you
allocate an array of size 10, instead of 100.
It is not a huge improvement, but it does save some allocations, as well as
it's a convenient method.

BTW, I like the name ResultsCollector, as it's just like HitCollector, but
does not commit too much to hits .. i.e., facets aren't hits ... I think?

Shai

On Thu, Mar 26, 2009 at 12:55 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Shai Erera ser...@gmail.com wrote:

  The difference is for the new code, it's an upcast, which catches any
  errors at compile time, not run time.  The compiler determines that
  the class implements the required interface.
 
  I still don't understand how a compiler can detect at compilation time
 that
  a HitCollector instance that is used in method A, and is casted to a
  TopDocsOutput instance by calling method B (from A) is actually ok ... I
 may
  be missing some Java information here, but I simply don't see how that
  happens at compilation time instead of at runtime ...

 I may have lost the context here... but here's what I thought we were
 talking about.

 If we choose the interface option (adding a ProvidesTopDocsResults
 (say) interface), then you would create method
 renderResults(ProvidesTopDocResults).

 Then, any collector implementing that interface could be safely passed
 in, as the upcast is done at compile time not runtime.

  So in fact the internal Lucene code expects only MRHC from a certain
 point,
  and so even if I wrote a HC and passed it on Searcher, it's still
 converted
  to MRHC, with an empty setNextReader() method implementation. That's why
 I
  meant that HC is already deprecated, whether it's marked as deprecated or
  not.

 The setNextReader() impl is not empty; it does the re-basing of docID
 on behalf of the HC.

  What you say about deprecating HC to me is unnecessary. Simply pull
  setNextReader up with an empty implementation, get rid of all the
  instanceof, casting and wrapping code and you're fine. Nothing is broken.
  All works well and better (instanceof, casting and wrapping have their
  overheads).
  Isn't that right?

 I think we need to deprecate HC, in favor of MRHC (or if we can think
 of a better name... ResultCollector?).

  Regarding interfaces .. I don't think they're that bad. Perhaps a
 different
  viewing angle might help. Lucene processes a query and fires events.
 Today
  it fires an event every time a doc's score has been computed, and
 recently
  every time it starts to process a different reader. HitCollector is a
  listener implementation on the doc-score event, while MRHC is a
 listener
  on both.
  To me, interfaces play just nicely here. Assume that you have the
 following
  interfaces:
  - DocScoreEvent
  - ChangeReaderEvent
  - EndProcessingEvent (thrown when the query has finished processing -
  perhaps to aid collectors to free resources)
  - any other events you foresee?
  The Lucene 

Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Shai Erera
BTW Mike, I noticed that TopFieldDocCollector extends TopScoreDocCollector.
That is a problem if we want to make TSDC final. Now, TFDC is marked
deprecated, so it will be removed in the future.
I think an easy fix is just to have TFDC extend TopResultsCollector, right?

On Thu, Mar 26, 2009 at 2:52 PM, Shai Erera ser...@gmail.com wrote:

 I may have lost the context here... but here's what I thought we were
 talking about.

 If we choose the interface option (adding a ProvidesTopDocsResults
 (say) interface), then you would create method
 renderResults(ProvidesTopDocResults).

 Then, any collector implementing that interface could be safely passed
 in, as the upcast is done at compile time not runtime.


 Consider this code snippet:

 HitCollector hc = condition? new TopDocCollector() :
 TopFieldDocCollector();
 searcher.search(hc);

 The problem is that I need a base class for both collectors. If I use the
 interface ProvidesTopDocsResults, then I cannot pass it to searcher, w/o
 casting to HitCollector. If I use HitCollector, then I need to cast it
 before passing it into rederResults(). Only when both class have the same
 base class which is also a HitCollector, I don't need any casting. I.e.,
 after I submit a patch that develops what we've agreed on, the code can look
 like this:

 TopResultsCollector trc = condition ? new TopScoreDocCollector() : new
 TopFieldCollector();
 searcher.search(trc);
 renderResults(trc);

 Here I can pass 'trc' to both methods since it both a HitCollector and a
 TopResultsCollector. That's what I was missing in your proposal.

 Shai do you want to take a first cut at making a patch?  Can you open
 an issue?  Thanks.


 I can certainly do that. I think the summary of the steps make sense. I'll
 check if TopScoreDocCollector and TopFieldCollector can also extend that
 you provide your own pqueue and we collect top N according to it
 collector, passing a null PQ and extending topDocs().

 I also would like to propose an additional method to topDocs(), topDocs(int
 start, int howMany) which will be more efficient to call in case of paging
 through results. The reason is that topDocs() pops everything from the PQ,
 then allocates a ScoreDoc[] array of size of number of results and returns a
 TopDocs object. You can then choose just the ones you need.
 On the other hand, topDocs(start, howMany) would allocate exactly the size
 of array needed. E.g., in case someone pages through results 91-100, you
 allocate an array of size 10, instead of 100.
 It is not a huge improvement, but it does save some allocations, as well as
 it's a convenient method.

 BTW, I like the name ResultsCollector, as it's just like HitCollector, but
 does not commit too much to hits .. i.e., facets aren't hits ... I think?

 Shai

 On Thu, Mar 26, 2009 at 12:55 PM, Michael McCandless 
 luc...@mikemccandless.com wrote:

 Shai Erera ser...@gmail.com wrote:

  The difference is for the new code, it's an upcast, which catches any
  errors at compile time, not run time.  The compiler determines that
  the class implements the required interface.
 
  I still don't understand how a compiler can detect at compilation time
 that
  a HitCollector instance that is used in method A, and is casted to a
  TopDocsOutput instance by calling method B (from A) is actually ok ... I
 may
  be missing some Java information here, but I simply don't see how that
  happens at compilation time instead of at runtime ...

 I may have lost the context here... but here's what I thought we were
 talking about.

 If we choose the interface option (adding a ProvidesTopDocsResults
 (say) interface), then you would create method
 renderResults(ProvidesTopDocResults).

 Then, any collector implementing that interface could be safely passed
 in, as the upcast is done at compile time not runtime.

  So in fact the internal Lucene code expects only MRHC from a certain
 point,
  and so even if I wrote a HC and passed it on Searcher, it's still
 converted
  to MRHC, with an empty setNextReader() method implementation. That's why
 I
  meant that HC is already deprecated, whether it's marked as deprecated
 or
  not.

 The setNextReader() impl is not empty; it does the re-basing of docID
 on behalf of the HC.

  What you say about deprecating HC to me is unnecessary. Simply pull
  setNextReader up with an empty implementation, get rid of all the
  instanceof, casting and wrapping code and you're fine. Nothing is
 broken.
  All works well and better (instanceof, casting and wrapping have their
  overheads).
  Isn't that right?

 I think we need to deprecate HC, in favor of MRHC (or if we can think
 of a better name... ResultCollector?).

  Regarding interfaces .. I don't think they're that bad. Perhaps a
 different
  viewing angle might help. Lucene processes a query and fires events.
 Today
  it fires an event every time a doc's score has been computed, and
 recently
  every time it starts to process a different reader. HitCollector is a
  

Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Earwin Burrfoot
 BTW, I like the name ResultsCollector, as it's just like HitCollector, but 
 does not commit too much to hits .. i.e., facets aren't hits ... I think?
What this class consumes and what it produces is a totally different
thing. HitCollector always collects 'hits', and then produces whatever
implementor needs.
For example mine collects hits, then collapses 1..N sequential hits
into a 'metahit', calculates facets, sorts, takes top and loads some
fields. And another one simply counts the hits without doing anything
else. But oh, my, I'm not implementing anything like void
collect(Facet f) method.

It's common sense to name consumer interfaces after what they consume,
not what their implementations might do.

 Or do you have an alternative?
HitCollector is absolutely cool with me. Okay, maybe DocCollector, or
DocIdCollector. Since that is exactly what 'all' of its
implementations do.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Marvin Humphrey
On Thu, Mar 26, 2009 at 08:44:57AM -0400, Michael McCandless wrote:

 do you have an alternative?

Brainstorming

  * Harvester
  * Trawler
  * HitPicker
  * HitGrabber

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Earwin Burrfoot
 On Thu, Mar 26, 2009 at 08:44:57AM -0400, Michael McCandless wrote:

 do you have an alternative?

 Brainstorming

  * Harvester
  * Trawler
  * HitPicker
  * HitGrabber

 Marvin Humphrey

NitPicker - that absolutely made my day


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Shai Erera
I still think that ResultsCollector does what you describe. It simply
collects results, while the word 'result' is quite *open* and does not
commit to anything ...

How about dropping the word Collector, since it might not collect anything,
and just save the highest score, or compute some facets ..

What about something with a *Listener like: DocIdListener, SearchListener,
MatchListener (it listens on search matches).

On Thu, Mar 26, 2009 at 3:48 PM, Marvin Humphrey mar...@rectangular.comwrote:

 On Thu, Mar 26, 2009 at 08:44:57AM -0400, Michael McCandless wrote:

  do you have an alternative?

 Brainstorming

  * Harvester
  * Trawler
  * HitPicker
  * HitGrabber

 Marvin Humphrey


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Michael McCandless
Shai Erera ser...@gmail.com wrote:
 I may have lost the context here... but here's what I thought we were
 talking about.

 If we choose the interface option (adding a ProvidesTopDocsResults
 (say) interface), then you would create method
 renderResults(ProvidesTopDocResults).

 Then, any collector implementing that interface could be safely passed
 in, as the upcast is done at compile time not runtime.

 Consider this code snippet:

 HitCollector hc = condition? new TopDocCollector() : TopFieldDocCollector();
 searcher.search(hc);

 The problem is that I need a base class for both collectors. If I use the
 interface ProvidesTopDocsResults, then I cannot pass it to searcher, w/o
 casting to HitCollector. If I use HitCollector, then I need to cast it
 before passing it into rederResults(). Only when both class have the same
 base class which is also a HitCollector, I don't need any casting. I.e.,
 after I submit a patch that develops what we've agreed on, the code can look
 like this:

 TopResultsCollector trc = condition ? new TopScoreDocCollector() : new
 TopFieldCollector();
 searcher.search(trc);
 renderResults(trc);

 Here I can pass 'trc' to both methods since it both a HitCollector and a
 TopResultsCollector. That's what I was missing in your proposal.

OK I agree.  And with our proposed changes (TopResultsCollector), you
can do this.

 Shai do you want to take a first cut at making a patch?  Can you open
 an issue?  Thanks.

 I can certainly do that. I think the summary of the steps make sense. I'll
 check if TopScoreDocCollector and TopFieldCollector can also extend that
 you provide your own pqueue and we collect top N according to it
 collector, passing a null PQ and extending topDocs().

OK, thanks.

 I also would like to propose an additional method to topDocs(), topDocs(int
 start, int howMany) which will be more efficient to call in case of paging
 through results. The reason is that topDocs() pops everything from the PQ,
 then allocates a ScoreDoc[] array of size of number of results and returns a
 TopDocs object. You can then choose just the ones you need.
 On the other hand, topDocs(start, howMany) would allocate exactly the size
 of array needed. E.g., in case someone pages through results 91-100, you
 allocate an array of size 10, instead of 100.
 It is not a huge improvement, but it does save some allocations, as well as
 it's a convenient method.

Though... this is somewhat tricky to implement efficiently when using
pqueue: you pop the worst scoring hit first, then next worst scoring,
etc, into an array (in reverse order).  It would be conceivable to do
a [separate] partial sort of the queue to more efficiently retrieve a
top subset of N to save some time on the extraction.  But my guess is
extraction time is trivial; I don't think we need to optimize it.

That being said, we could make the API like this, but under the hood
simply do what we do today the first time it's called, leaving as a
future optimization to speed it up.

Alternatively we could make a ScoreDoc result(int n) to retrieve each
result one by one... or maybe doc(int n) and score(int n) since some
collectors won't score (but, then we'd need to handle FieldDoc, which
is used to more generically return the sort field values for each
result).

 BTW, I like the name ResultsCollector, as it's just like HitCollector, but
 does not commit too much to hits .. i.e., facets aren't hits ... I think?

I like it too!

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Michael McCandless
Shai Erera ser...@gmail.com wrote:
 BTW Mike, I noticed that TopFieldDocCollector extends TopScoreDocCollector.

Weird.  Probably we could put that back to extending [deprecated]
TopDocCollector?

 That is a problem if we want to make TSDC final. Now, TFDC is marked
 deprecated, so it will be removed in the future.

Right.

 I think an easy fix is just to have TFDC extend TopResultsCollector, right?

Or, back to the way it was pre-1483 (extend TopDocCollector).

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Michael McCandless
In 1483 Doug had also suggested:

  * Hitable

I suppose Collector shouldn't really be in the name, since the class
may not actually collect the results (eg if it simply counts).

Mike

Marvin Humphrey mar...@rectangular.com wrote:
 On Thu, Mar 26, 2009 at 08:44:57AM -0400, Michael McCandless wrote:

 do you have an alternative?

 Brainstorming

  * Harvester
  * Trawler
  * HitPicker
  * HitGrabber

 Marvin Humphrey


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: List Moderators

2009-03-26 Thread Chris Hostetter

: Every now and again, someone emails me off list asking to be removed from the
: list and I always forward them to Erik, b/c I know he is a moderator.
: However, I was wondering who else is besides Erik, since, AIUI, there needs to
: be at least 3 in ASF-land, right?
: 
: So, if you're a list moderator for dev/user, please stand up.

the docs for say committers have instructions for checking the moderators 
for any list, however the process seems to no longer work (probably 
because mail handling got moved onto a different box)...

http://www.apache.org/dev/committers.html#mailing-list-moderators
https://svn.apache.org/repos/private/committers/docs/resources.txt

...might be worth following up with INFRA to sanity check the list of 
moderators on all lucene lists, make sure we have three *active* 
moderators on each list.


-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: List Moderators

2009-03-26 Thread patrick o'leary
Is it also worth while to check if a static signature can be added to mails
with instructions
Or a link to the apache mail instructions?
It will reduce a lot of repeat questions.



On Thu, Mar 26, 2009 at 2:46 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : Every now and again, someone emails me off list asking to be removed from
 the
 : list and I always forward them to Erik, b/c I know he is a moderator.
 : However, I was wondering who else is besides Erik, since, AIUI, there
 needs to
 : be at least 3 in ASF-land, right?
 :
 : So, if you're a list moderator for dev/user, please stand up.

 the docs for say committers have instructions for checking the moderators
 for any list, however the process seems to no longer work (probably
 because mail handling got moved onto a different box)...

 http://www.apache.org/dev/committers.html#mailing-list-moderators
 https://svn.apache.org/repos/private/committers/docs/resources.txt

 ...might be worth following up with INFRA to sanity check the list of
 moderators on all lucene lists, make sure we have three *active*
 moderators on each list.


 -Hoss


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Updated: (LUCENE-1573) IndexWriter does not do the right thing when a Thread is interrupt()'d

2009-03-26 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1573:
---

Attachment: LUCENE-1573.patch

Attached patch.  All tests pass, including a new one that showed the deadlock.

I also found  fixed a case where IndexWriter would hang during close (thinking 
a BG merge was still running when it wasn't) if the InterruptedException 
arrived at the right time.

I'll commit in a day or two.

 IndexWriter does not do the right thing when a Thread is interrupt()'d
 --

 Key: LUCENE-1573
 URL: https://issues.apache.org/jira/browse/LUCENE-1573
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1573.patch


 Spinoff from here:
 
 http://www.nabble.com/Deadlock-with-concurrent-merges-and-IndexWriter--Lucene-2.4--to22714290.html
 When a Thread is interrupt()'d while inside Lucene, there is a risk currently 
 that it will cause a spinloop and starve BG merges from completing.
 Instead, when possible, we should allow interruption.  But unfortunately for 
 back-compat, we will need to wrap the exception in an unchecked version.  In 
 3.0 we can change that to simply throw InterruptedException.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2009-03-26 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12689624#action_12689624
 ] 

Uwe Schindler commented on LUCENE-831:
--

I will attach my comments regarding the problem with the TrieRangeFilter and 
sorting (stop collecting terms into cache when lower precisions begin or only 
collect terms using a specific range (like a range filter). So you could fill a 
FieldCache and specify a starting term and ending term, all terms inbetween 
could be put into the cache, others outside left out. In this way, it would be 
possible to just use TrieUtils.prefixCodeLong() to specify the upper and lower 
integer bound encoded in the highest precision.

 Complete overhaul of FieldCache API/Implementation
 --

 Key: LUCENE-831
 URL: https://issues.apache.org/jira/browse/LUCENE-831
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
 Fix For: 3.0

 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
 fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
 LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
 LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch


 Motivation:
 1) Complete overhaul the API/implementation of FieldCache type things...
 a) eliminate global static map keyed on IndexReader (thus
 eliminating synch block between completley independent IndexReaders)
 b) allow more customization of cache management (ie: use 
 expiration/replacement strategies, disk backed caches, etc)
 c) allow people to define custom cache data logic (ie: custom
 parsers, complex datatypes, etc... anything tied to a reader)
 d) allow people to inspect what's in a cache (list of CacheKeys) for
 an IndexReader so a new IndexReader can be likewise warmed. 
 e) Lend support for smarter cache management if/when
 IndexReader.reopen is added (merging of cached data from subReaders).
 2) Provide backwards compatibility to support existing FieldCache API with
 the new implementation, so there is no redundent caching as client code
 migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2009-03-26 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12689629#action_12689629
 ] 

Tim Smith commented on LUCENE-831:
--

One requirement i would like to request is the ability to attach an arbitrary 
object to each Segment.
This will allow people using lucene to store any arbitrary per segment caches 
and statistics that their application requires (fully free form)

Would like to see the following:
* add SegmentReader.setCustomCacheManager(CacheManager m) // mabye add a string 
for a CacheManager id (to allow registration of multiple cache managers)
* add SegmentReader.getCustomCacheManager() // to allow accessing the manager

CacheManager should be a very light interface (just a close() method that is 
called when the SegmentReader is closed)



 Complete overhaul of FieldCache API/Implementation
 --

 Key: LUCENE-831
 URL: https://issues.apache.org/jira/browse/LUCENE-831
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
 Fix For: 3.0

 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
 fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
 LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
 LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch


 Motivation:
 1) Complete overhaul the API/implementation of FieldCache type things...
 a) eliminate global static map keyed on IndexReader (thus
 eliminating synch block between completley independent IndexReaders)
 b) allow more customization of cache management (ie: use 
 expiration/replacement strategies, disk backed caches, etc)
 c) allow people to define custom cache data logic (ie: custom
 parsers, complex datatypes, etc... anything tied to a reader)
 d) allow people to inspect what's in a cache (list of CacheKeys) for
 an IndexReader so a new IndexReader can be likewise warmed. 
 e) Lend support for smarter cache management if/when
 IndexReader.reopen is added (merging of cached data from subReaders).
 2) Provide backwards compatibility to support existing FieldCache API with
 the new implementation, so there is no redundent caching as client code
 migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1567) New flexible query parser

2009-03-26 Thread Luis Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Alves updated LUCENE-1567:
---

Attachment: lucene_trunk_FlexQueryParser_2009March26.patch

Here is an updated version of the patch with minor fixes. This version does not 
delete the old lucene queryparser.

build.xml default, javadocs, test-core all run fine.

 New flexible query parser
 -

 Key: LUCENE-1567
 URL: https://issues.apache.org/jira/browse/LUCENE-1567
 Project: Lucene - Java
  Issue Type: New Feature
  Components: QueryParser
 Environment: N/A
Reporter: Luis Alves
Assignee: Michael Busch
 Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, 
 lucene_trunk_FlexQueryParser_2009March26.patch


 From New flexible query parser thread by Micheal Busch
 in my team at IBM we have used a different query parser than Lucene's in
 our products for quite a while. Recently we spent a significant amount
 of time in refactoring the code and designing a very generic
 architecture, so that this query parser can be easily used for different
 products with varying query syntaxes.
 This work was originally driven by Andreas Neumann (who, however, left
 our team); most of the code was written by Luis Alves, who has been a
 bit active in Lucene in the past, and Adriano Campos, who joined our
 team at IBM half a year ago. Adriano is Apache committer and PMC member
 on the Tuscany project and getting familiar with Lucene now too.
 We think this code is much more flexible and extensible than the current
 Lucene query parser, and would therefore like to contribute it to
 Lucene. I'd like to give a very brief architecture overview here,
 Adriano and Luis can then answer more detailed questions as they're much
 more familiar with the code than I am.
 The goal was it to separate syntax and semantics of a query. E.g. 'a AND
 b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
 We distinguish the semantics of the different query components, e.g.
 whether and how to tokenize/lemmatize/normalize the different terms or
 which Query objects to create for the terms. We wanted to be able to
 write a parser with a new syntax, while reusing the underlying
 semantics, as quickly as possible.
 In fact, Adriano is currently working on a 100% Lucene-syntax compatible
 implementation to make it easy for people who are using Lucene's query
 parser to switch.
 The query parser has three layers and its core is what we call the
 QueryNodeTree. It is a tree that initially represents the syntax of the
 original query, e.g. for 'a AND b':
   AND
  /   \
 A B
 The three layers are:
 1. QueryParser
 2. QueryNodeProcessor
 3. QueryBuilder
 1. The upper layer is the parsing layer which simply transforms the
 query text string into a QueryNodeTree. Currently our implementations of
 this layer use javacc.
 2. The query node processors do most of the work. It is in fact a
 configurable chain of processors. Each processors can walk the tree and
 modify nodes or even the tree's structure. That makes it possible to
 e.g. do query optimization before the query is executed or to tokenize
 terms.
 3. The third layer is also a configurable chain of builders, which
 transform the QueryNodeTree into Lucene Query objects.
 Furthermore the query parser uses flexible configuration objects, which
 are based on AttributeSource/Attribute. It also uses message classes that
 allow to attach resource bundles. This makes it possible to translate
 messages, which is an important feature of a query parser.
 This design allows us to develop different query syntaxes very quickly.
 Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
 underlying processors and builders in a few days. We now have a 100%
 compatible Lucene query parser, which means the syntax is identical and
 all query parser test cases pass on the new one too using a wrapper.
 Recent posts show that there is demand for query syntax improvements,
 e.g improved range query syntax or operator precedence. There are
 already different QP implementations in Lucene+contrib, however I think
 we did not keep them all up to date and in sync. This is not too
 surprising, because usually when fixes and changes are made to the main
 query parser, people don't make the corresponding changes in the contrib
 parsers. (I'm guilty here too)
 With this new architecture it will be much easier to maintain different
 query syntaxes, as the actual code for the first layer is not very much.
 All syntaxes would benefit from patches and improvements we make to the
 underlying layers, which will make supporting different syntaxes much
 more manageable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Marvin Humphrey
On Thu, Mar 26, 2009 at 04:01:26PM +0200, Shai Erera wrote:
 I still think that ResultsCollector does what you describe. It simply
 collects results, while the word 'result' is quite *open* and does not
 commit to anything ...

Another advantage of ResultsCollector is that the name suggests good
self-documenting subclass names and variable names.  For instance, it's
reasonably clear what a BitSetCollector or a TopDocsCollector might do.
And when there's only one var around, the name collector is an obvious
choice no matter what the class.  This is all possible because there's no
other use of Collector within Lucene.

I just think ResultsCollector is less euphonic and zippy than
HitCollector, so it's worth exploring alternatives.  

 How about dropping the word Collector, since it might not collect anything,
 and just save the highest score, or compute some facets ..

Sure, wiping the slate clean and re-examining HitCollector's actual purpose
and usage to discover new names is a good approach.  Similar thinking went
into Hitable and Harvester.

FWIW, I'd have to disagree that HitCollector doesn't collect anything.  It may
not collect hits per se, but it's definitely iterating over hits (in the
sense of successful matches) and with only rare exceptions, it's collecting
*something*.

 What about something with a *Listener like: DocIdListener, SearchListener,
 MatchListener (it listens on search matches).

Considering how we attach HITCOLLECTORTHINGY onto the matching process is a
novel take and clarifying to see.  However, maybe it's just me, but *Listener
evokes the JavaScript EventListener stuff, which seems radically different.
Also, if I saw a listener variable in scoring loop code, or a
TopDocsListener module in the JavaDocs, it wouldn't spring out to me that it
would be doing what a HitCollector does right now.

Cheers,

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1567) New flexible query parser

2009-03-26 Thread Luis Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12689680#action_12689680
 ] 

Luis Alves commented on LUCENE-1567:


HI Grant and Micheal
I faxed the CLA today.

 New flexible query parser
 -

 Key: LUCENE-1567
 URL: https://issues.apache.org/jira/browse/LUCENE-1567
 Project: Lucene - Java
  Issue Type: New Feature
  Components: QueryParser
 Environment: N/A
Reporter: Luis Alves
Assignee: Michael Busch
 Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, 
 lucene_trunk_FlexQueryParser_2009March26.patch


 From New flexible query parser thread by Micheal Busch
 in my team at IBM we have used a different query parser than Lucene's in
 our products for quite a while. Recently we spent a significant amount
 of time in refactoring the code and designing a very generic
 architecture, so that this query parser can be easily used for different
 products with varying query syntaxes.
 This work was originally driven by Andreas Neumann (who, however, left
 our team); most of the code was written by Luis Alves, who has been a
 bit active in Lucene in the past, and Adriano Campos, who joined our
 team at IBM half a year ago. Adriano is Apache committer and PMC member
 on the Tuscany project and getting familiar with Lucene now too.
 We think this code is much more flexible and extensible than the current
 Lucene query parser, and would therefore like to contribute it to
 Lucene. I'd like to give a very brief architecture overview here,
 Adriano and Luis can then answer more detailed questions as they're much
 more familiar with the code than I am.
 The goal was it to separate syntax and semantics of a query. E.g. 'a AND
 b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
 We distinguish the semantics of the different query components, e.g.
 whether and how to tokenize/lemmatize/normalize the different terms or
 which Query objects to create for the terms. We wanted to be able to
 write a parser with a new syntax, while reusing the underlying
 semantics, as quickly as possible.
 In fact, Adriano is currently working on a 100% Lucene-syntax compatible
 implementation to make it easy for people who are using Lucene's query
 parser to switch.
 The query parser has three layers and its core is what we call the
 QueryNodeTree. It is a tree that initially represents the syntax of the
 original query, e.g. for 'a AND b':
   AND
  /   \
 A B
 The three layers are:
 1. QueryParser
 2. QueryNodeProcessor
 3. QueryBuilder
 1. The upper layer is the parsing layer which simply transforms the
 query text string into a QueryNodeTree. Currently our implementations of
 this layer use javacc.
 2. The query node processors do most of the work. It is in fact a
 configurable chain of processors. Each processors can walk the tree and
 modify nodes or even the tree's structure. That makes it possible to
 e.g. do query optimization before the query is executed or to tokenize
 terms.
 3. The third layer is also a configurable chain of builders, which
 transform the QueryNodeTree into Lucene Query objects.
 Furthermore the query parser uses flexible configuration objects, which
 are based on AttributeSource/Attribute. It also uses message classes that
 allow to attach resource bundles. This makes it possible to translate
 messages, which is an important feature of a query parser.
 This design allows us to develop different query syntaxes very quickly.
 Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
 underlying processors and builders in a few days. We now have a 100%
 compatible Lucene query parser, which means the syntax is identical and
 all query parser test cases pass on the new one too using a wrapper.
 Recent posts show that there is demand for query syntax improvements,
 e.g improved range query syntax or operator precedence. There are
 already different QP implementations in Lucene+contrib, however I think
 we did not keep them all up to date and in sync. This is not too
 surprising, because usually when fixes and changes are made to the main
 query parser, people don't make the corresponding changes in the contrib
 parsers. (I'm guilty here too)
 With this new architecture it will be much easier to maintain different
 query syntaxes, as the actual code for the first layer is not very much.
 All syntaxes would benefit from patches and improvements we make to the
 underlying layers, which will make supporting different syntaxes much
 more manageable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, 

Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Michael McCandless
On Thu, Mar 26, 2009 at 5:28 PM, Marvin Humphrey mar...@rectangular.com wrote:
 On Thu, Mar 26, 2009 at 04:01:26PM +0200, Shai Erera wrote:
 I still think that ResultsCollector does what you describe. It simply
 collects results, while the word 'result' is quite *open* and does not
 commit to anything ...

 Another advantage of ResultsCollector is that the name suggests good
 self-documenting subclass names and variable names.  For instance, it's
 reasonably clear what a BitSetCollector or a TopDocsCollector might do.
 And when there's only one var around, the name collector is an obvious
 choice no matter what the class.  This is all possible because there's no
 other use of Collector within Lucene.

I think ResultsCollector (or maybe ResultCollector) is my favorite so far...

But how about simply Collector?  (I realize it's very generic... but
we don't collect anything else in Lucene?).

 What about something with a *Listener like: DocIdListener, SearchListener,
 MatchListener (it listens on search matches).

 Considering how we attach HITCOLLECTORTHINGY onto the matching process is a
 novel take and clarifying to see.  However, maybe it's just me, but *Listener
 evokes the JavaScript EventListener stuff, which seems radically different.
 Also, if I saw a listener variable in scoring loop code, or a
 TopDocsListener module in the JavaDocs, it wouldn't spring out to me that it
 would be doing what a HitCollector does right now.

Yeah I'm not really a fan of Listener either.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Marvin Humphrey
On Thu, Mar 26, 2009 at 06:03:07PM -0400, Michael McCandless wrote:

 But how about simply Collector?  (I realize it's very generic... but
 we don't collect anything else in Lucene?).

+1

Honorable mention to NitPicker, LOL.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Earwin Burrfoot
 I think ResultsCollector (or maybe ResultCollector) is my favorite so far...

 But how about simply Collector?  (I realize it's very generic... but
 we don't collect anything else in Lucene?).
That's exactly what I'm using in my app - abstract class Collector
extends HitCollector, that serves as a base for all my custom
collectors :)
So, yeah, I like this name.

 Yeah I'm not really a fan of Listener either.
+1

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays

2009-03-26 Thread Jason Rutherglen (JIRA)
PooledSegmentReader, pools SegmentReader underlying byte arrays
---

 Key: LUCENE-1574
 URL: https://issues.apache.org/jira/browse/LUCENE-1574
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9


PooledSegmentReader pools the underlying byte arrays of deleted docs and norms 
for realtime search.  It is designed for use with IndexReader.clone which can 
create many copies of byte arrays, which are of the same length for a given 
segment.  When pooled they can be reused which could save on memory.  

Do we want to benchmark the memory usage comparison of PooledSegmentReader vs 
GC?  Many times GC is enough for these smaller objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1567) New flexible query parser

2009-03-26 Thread Luis Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Alves updated LUCENE-1567:
---

Attachment: (was: lucene_trunk_FlexQueryParser_2009March26.patch)

 New flexible query parser
 -

 Key: LUCENE-1567
 URL: https://issues.apache.org/jira/browse/LUCENE-1567
 Project: Lucene - Java
  Issue Type: New Feature
  Components: QueryParser
 Environment: N/A
Reporter: Luis Alves
Assignee: Michael Busch
 Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, 
 lucene_trunk_FlexQueryParser_2009March26_v3.patch


 From New flexible query parser thread by Micheal Busch
 in my team at IBM we have used a different query parser than Lucene's in
 our products for quite a while. Recently we spent a significant amount
 of time in refactoring the code and designing a very generic
 architecture, so that this query parser can be easily used for different
 products with varying query syntaxes.
 This work was originally driven by Andreas Neumann (who, however, left
 our team); most of the code was written by Luis Alves, who has been a
 bit active in Lucene in the past, and Adriano Campos, who joined our
 team at IBM half a year ago. Adriano is Apache committer and PMC member
 on the Tuscany project and getting familiar with Lucene now too.
 We think this code is much more flexible and extensible than the current
 Lucene query parser, and would therefore like to contribute it to
 Lucene. I'd like to give a very brief architecture overview here,
 Adriano and Luis can then answer more detailed questions as they're much
 more familiar with the code than I am.
 The goal was it to separate syntax and semantics of a query. E.g. 'a AND
 b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
 We distinguish the semantics of the different query components, e.g.
 whether and how to tokenize/lemmatize/normalize the different terms or
 which Query objects to create for the terms. We wanted to be able to
 write a parser with a new syntax, while reusing the underlying
 semantics, as quickly as possible.
 In fact, Adriano is currently working on a 100% Lucene-syntax compatible
 implementation to make it easy for people who are using Lucene's query
 parser to switch.
 The query parser has three layers and its core is what we call the
 QueryNodeTree. It is a tree that initially represents the syntax of the
 original query, e.g. for 'a AND b':
   AND
  /   \
 A B
 The three layers are:
 1. QueryParser
 2. QueryNodeProcessor
 3. QueryBuilder
 1. The upper layer is the parsing layer which simply transforms the
 query text string into a QueryNodeTree. Currently our implementations of
 this layer use javacc.
 2. The query node processors do most of the work. It is in fact a
 configurable chain of processors. Each processors can walk the tree and
 modify nodes or even the tree's structure. That makes it possible to
 e.g. do query optimization before the query is executed or to tokenize
 terms.
 3. The third layer is also a configurable chain of builders, which
 transform the QueryNodeTree into Lucene Query objects.
 Furthermore the query parser uses flexible configuration objects, which
 are based on AttributeSource/Attribute. It also uses message classes that
 allow to attach resource bundles. This makes it possible to translate
 messages, which is an important feature of a query parser.
 This design allows us to develop different query syntaxes very quickly.
 Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
 underlying processors and builders in a few days. We now have a 100%
 compatible Lucene query parser, which means the syntax is identical and
 all query parser test cases pass on the new one too using a wrapper.
 Recent posts show that there is demand for query syntax improvements,
 e.g improved range query syntax or operator precedence. There are
 already different QP implementations in Lucene+contrib, however I think
 we did not keep them all up to date and in sync. This is not too
 surprising, because usually when fixes and changes are made to the main
 query parser, people don't make the corresponding changes in the contrib
 parsers. (I'm guilty here too)
 With this new architecture it will be much easier to maintain different
 query syntaxes, as the actual code for the first layer is not very much.
 All syntaxes would benefit from patches and improvements we make to the
 underlying layers, which will make supporting different syntaxes much
 more manageable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] Updated: (LUCENE-1567) New flexible query parser

2009-03-26 Thread Luis Alves (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Alves updated LUCENE-1567:
---

Attachment: lucene_trunk_FlexQueryParser_2009March26_v3.patch

I cleaned up all the javadocs on this one.

 New flexible query parser
 -

 Key: LUCENE-1567
 URL: https://issues.apache.org/jira/browse/LUCENE-1567
 Project: Lucene - Java
  Issue Type: New Feature
  Components: QueryParser
 Environment: N/A
Reporter: Luis Alves
Assignee: Michael Busch
 Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, 
 lucene_trunk_FlexQueryParser_2009March26_v3.patch


 From New flexible query parser thread by Micheal Busch
 in my team at IBM we have used a different query parser than Lucene's in
 our products for quite a while. Recently we spent a significant amount
 of time in refactoring the code and designing a very generic
 architecture, so that this query parser can be easily used for different
 products with varying query syntaxes.
 This work was originally driven by Andreas Neumann (who, however, left
 our team); most of the code was written by Luis Alves, who has been a
 bit active in Lucene in the past, and Adriano Campos, who joined our
 team at IBM half a year ago. Adriano is Apache committer and PMC member
 on the Tuscany project and getting familiar with Lucene now too.
 We think this code is much more flexible and extensible than the current
 Lucene query parser, and would therefore like to contribute it to
 Lucene. I'd like to give a very brief architecture overview here,
 Adriano and Luis can then answer more detailed questions as they're much
 more familiar with the code than I am.
 The goal was it to separate syntax and semantics of a query. E.g. 'a AND
 b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
 We distinguish the semantics of the different query components, e.g.
 whether and how to tokenize/lemmatize/normalize the different terms or
 which Query objects to create for the terms. We wanted to be able to
 write a parser with a new syntax, while reusing the underlying
 semantics, as quickly as possible.
 In fact, Adriano is currently working on a 100% Lucene-syntax compatible
 implementation to make it easy for people who are using Lucene's query
 parser to switch.
 The query parser has three layers and its core is what we call the
 QueryNodeTree. It is a tree that initially represents the syntax of the
 original query, e.g. for 'a AND b':
   AND
  /   \
 A B
 The three layers are:
 1. QueryParser
 2. QueryNodeProcessor
 3. QueryBuilder
 1. The upper layer is the parsing layer which simply transforms the
 query text string into a QueryNodeTree. Currently our implementations of
 this layer use javacc.
 2. The query node processors do most of the work. It is in fact a
 configurable chain of processors. Each processors can walk the tree and
 modify nodes or even the tree's structure. That makes it possible to
 e.g. do query optimization before the query is executed or to tokenize
 terms.
 3. The third layer is also a configurable chain of builders, which
 transform the QueryNodeTree into Lucene Query objects.
 Furthermore the query parser uses flexible configuration objects, which
 are based on AttributeSource/Attribute. It also uses message classes that
 allow to attach resource bundles. This makes it possible to translate
 messages, which is an important feature of a query parser.
 This design allows us to develop different query syntaxes very quickly.
 Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
 underlying processors and builders in a few days. We now have a 100%
 compatible Lucene query parser, which means the syntax is identical and
 all query parser test cases pass on the new one too using a wrapper.
 Recent posts show that there is demand for query syntax improvements,
 e.g improved range query syntax or operator precedence. There are
 already different QP implementations in Lucene+contrib, however I think
 we did not keep them all up to date and in sync. This is not too
 surprising, because usually when fixes and changes are made to the main
 query parser, people don't make the corresponding changes in the contrib
 parsers. (I'm guilty here too)
 With this new architecture it will be much easier to maintain different
 query syntaxes, as the actual code for the first layer is not very much.
 All syntaxes would benefit from patches and improvements we make to the
 underlying layers, which will make supporting different syntaxes much
 more manageable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For 

Hudson build is back to normal: Lucene-trunk #778

2009-03-26 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/778/changes



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Shai Erera
I really liked HItCollector and hate to give up the name ... However
Collector is fine with me either, and it is at least more generic than
HitCollector ...

Hitable sounds too aggressive/violent to me :)

BTW, I guess I should create some new searcher API which receives this
Collector class (is Collector the chosen name?) and deprecate those who
accept HitCollector?
Those can also skip the instanceof check, and wrapping of HC to MRHC ...

That also means that I should throw that MRHC wrapper (which rebases doc
Ids)? If HitCollector is deprecated, then there's no need to keep it. But
perhaps we want it there in 2.9 for easier migration? Personally I think
it's redundant since in 3.0 people will need to change all their collectors
anyway (since HitCollector will be removed, and every class which extends
HitCollector will need be modified). What do you think?

Also, there's no need to deprecate MRHC, since it's only in the trunk - I
can simply rename it, right?

Ok I'll go ahead and prepare a patch. We can discuss the name more, at the
end it will just be a short refactor action in Eclipse, so that shouldn't
hold us (or me) up.

Shai

On Fri, Mar 27, 2009 at 1:24 AM, Earwin Burrfoot ear...@gmail.com wrote:

  I think ResultsCollector (or maybe ResultCollector) is my favorite so
 far...
 
  But how about simply Collector?  (I realize it's very generic... but
  we don't collect anything else in Lucene?).
 That's exactly what I'm using in my app - abstract class Collector
 extends HitCollector, that serves as a base for all my custom
 collectors :)
 So, yeah, I like this name.

  Yeah I'm not really a fan of Listener either.
 +1

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org