[jira] Commented: (LUCENE-794) SpanScorer and SimpleSpanFragmenter for Contrib Highlighter

2007-04-10 Thread Sean O'Connor (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487981
 ] 

Sean O'Connor commented on LUCENE-794:
--

Thanks Mark. I had the trunk from a few days ago (perhaps a week), so that was 
just me being lazy : -).

Is there anything I should be aware of the: parser.setUseOldRangeQuery(true); 
in doSearching(String queryString)? [about  line 890 in 
SpanHighlighterTest.java]

I've read the javadocs which explain it a bit, but I don't think a understand 
enough to infer why you use it in the SpanHighterTest.java. If I can 
(relatively) safely ignore that, I will.

Sean


Mark Miller (JIRA) wrote:
[ 
[1]https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487860
 ] 

Mark Miller commented on LUCENE-794:


Sorry Sean, I forgot to mention that the patch is off of the latest 
Lucene trunk code.

The range query test should fail because they switched the query parser 
to return a constant score query instead of a range query. Cannot 
highlight a constant score query.

- Mark



  
SpanScorer and SimpleSpanFragmenter for Contrib Highlighter
---

Key: LUCENE-794
URL: [2]https://issues.apache.org/jira/browse/LUCENE-794
Project: Lucene - Java
 Issue Type: Improvement
 Components: Other
   Reporter: Mark Miller
   Priority: Minor
Attachments: CachedTokenStream.java, CachedTokenStream.java, 
CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, 
Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, 
Highlighter.java, HighlighterTest.java, HighlighterTest.java, 
HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, 
QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, 
QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, 
spanhighlighter2.patch, spanhighlighter3.patch, spanhighlighter5.patch, 
spanhighlighter_patch_4.zip, SpanHighlighterTest.java, 
SpanHighlighterTest.java, SpanScorer.java, SpanScorer.java, 
WeightedSpanTerm.java


This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter package 
that scores just like QueryScorer, but scores a 0 for Terms that did not cause 
the Query hit. This gives 'actual' hit highlighting for the range of SpanQuerys 
and PhraseQuery. There is also a new Fragmenter that attempts to fragment 
without breaking up Spans.
See [3]http://issues.apache.org/jira/browse/LUCENE-403 for some background.
There is a dependency on MemoryIndex.


  


[1] 
https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487860
[2] https://issues.apache.org/jira/browse/LUCENE-794
[3] http://issues.apache.org/jira/browse/LUCENE-403


> SpanScorer and SimpleSpanFragmenter for Contrib Highlighter
> ---
>
> Key: LUCENE-794
> URL: https://issues.apache.org/jira/browse/LUCENE-794
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
>Priority: Minor
> Attachments: CachedTokenStream.java, CachedTokenStream.java, 
> CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, 
> Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, 
> Highlighter.java, HighlighterTest.java, HighlighterTest.java, 
> HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, 
> QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, 
> QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, 
> spanhighlighter2.patch, spanhighlighter3.patch, spanhighlighter5.patch, 
> spanhighlighter_patch_4.zip, SpanHighlighterTest.java, 
> SpanHighlighterTest.java, SpanScorer.java, SpanScorer.java, 
> WeightedSpanTerm.java
>
>
> This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter 
> package that scores just like QueryScorer, but scores a 0 for Terms that did 
> not cause the Query hit. This gives 'actual' hit highlighting for the range 
> of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts 
> to fragment without breaking up Spans.
> See http://issues.apache.org/jira/browse/LUCENE-403 for some background.
> There is a dependency on MemoryIndex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTEC

[jira] Resolved: (LUCENE-857) Remove BitSet caching from QueryFilter

2007-04-10 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-857.
-

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

But of course.  Thanks for the catch!


> Remove BitSet caching from QueryFilter
> --
>
> Key: LUCENE-857
> URL: https://issues.apache.org/jira/browse/LUCENE-857
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Otis Gospodnetic
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: LUCENE-857.patch, LUCENE-857.refactoring-approach.diff
>
>
> Since caching is built into the public BitSet bits(IndexReader reader)  
> method, I don't see a way to deprecate that, which means I'll just cut it out 
> and document it in CHANGES.txt.  Anyone who wants QueryFilter caching will be 
> able to get the caching back by wrapping the QueryFilter in the 
> CachingWrapperFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-04-10 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487966
 ] 

Otis Gospodnetic commented on LUCENE-584:
-

Right.  I was under the wrong impression that the Matcher also happens to avoid 
scoring.  However, now that we've all looked at this patch (still applies 
cleanly and unit tests all pass), and nobody had any criticisms, I think we 
should commit it, say this Friday.

As I'm in the performance squeezing mode, I'll go look at LUCENE-730, another 
one of Paul's great patches, and see if I can measure performance improvement 
there.


> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, 
> Filter-20060628.patch, HitCollector-20060628.patch, 
> IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, 
> Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, 
> Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, 
> TestSortedVIntList.java
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Branding: the TLP, and "Lucene Java"

2007-04-10 Thread Otis Gospodnetic
For some reason I've never been confused by the naming.  I think in my mind and 
when I talk about this, I say "Lucene project" when I mean the TLP, and Lucene 
when I talk about the original Lucene.  Though I'd personally be said to see 
the original Lucene get renamed now, I'm open. :)

I agree about Grant about where we are going with Lucene TLP, and I'm very much 
looking forward to new things that will grow under the Lucene name.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Grant Ingersoll <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Tuesday, April 10, 2007 9:13:36 PM
Subject: Re: Lucene Branding: the TLP, and "Lucene Java"

No, you are not the only one...  Many a sleepless night spent on  
it...  :-)

I usually try to refer to it as Lucene Java, but old habits die hard  
and often times I just call it Lucene.  I think the name has a good  
brand at this point and is very strongly associated w/ the Java  
library.  I seem to recall when they were forming the TLP, that the  
original proposal was search.a.o, but then changed b/c the ASF didn't  
like generic names (or at least that is how I recall it.)  And, of  
course, with Hadoop and the potential for Tika/Lius, it isn't just  
search anymore.  I have often thought about an Apache "Text" project,  
that could eventually hold a whole family of text based tools like  
Lucene, Tika, Hadoop, Solr, etc. plus things like part of speech  
taggers, clustering/classification algorithms, UIMA, etc. all under  
one roof.  But that is just my two cents and I don't know if it fits  
with what other people have in mind.  There are a lot of OSS tools  
out there for these things, but none bring together a whole suite  
under a brand like Apache.

-Grant


On Apr 10, 2007, at 8:41 PM, Chris Hostetter wrote:

>
> I was motivated to start this thread by LUCENE-860, but it's been  
> in the
> back of my mind for a while.
>
> As the Lucene Top Level Project grows and get's more Sub-Projects, I
> (personally) have been finding it hard in email/documentation/ 
> discussion
> to clarify when people are refering to the "Lucene" Top Level Project
> versus the "Lucene" java project.  I can't help but wonder if the TLP
> should have a different name, or if "Lucene Java" should taken on a  
> more
> specific name that doesn't just sound like a name followed a  
> langauge --
> ie: JLucene, LuceneJ ... anything that makes it more clear that  
> when the
> word "Lucene" is used it's talking about the broder Top Level project
> address all aspects of OSS Search Software.
>
> Am I the only one that wonders about this as time goes on?
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Branding: the TLP, and "Lucene Java"

2007-04-10 Thread Grant Ingersoll
No, you are not the only one...  Many a sleepless night spent on  
it...  :-)


I usually try to refer to it as Lucene Java, but old habits die hard  
and often times I just call it Lucene.  I think the name has a good  
brand at this point and is very strongly associated w/ the Java  
library.  I seem to recall when they were forming the TLP, that the  
original proposal was search.a.o, but then changed b/c the ASF didn't  
like generic names (or at least that is how I recall it.)  And, of  
course, with Hadoop and the potential for Tika/Lius, it isn't just  
search anymore.  I have often thought about an Apache "Text" project,  
that could eventually hold a whole family of text based tools like  
Lucene, Tika, Hadoop, Solr, etc. plus things like part of speech  
taggers, clustering/classification algorithms, UIMA, etc. all under  
one roof.  But that is just my two cents and I don't know if it fits  
with what other people have in mind.  There are a lot of OSS tools  
out there for these things, but none bring together a whole suite  
under a brand like Apache.


-Grant


On Apr 10, 2007, at 8:41 PM, Chris Hostetter wrote:



I was motivated to start this thread by LUCENE-860, but it's been  
in the

back of my mind for a while.

As the Lucene Top Level Project grows and get's more Sub-Projects, I
(personally) have been finding it hard in email/documentation/ 
discussion

to clarify when people are refering to the "Lucene" Top Level Project
versus the "Lucene" java project.  I can't help but wonder if the TLP
should have a different name, or if "Lucene Java" should taken on a  
more
specific name that doesn't just sound like a name followed a  
langauge --
ie: JLucene, LuceneJ ... anything that makes it more clear that  
when the

word "Lucene" is used it's talking about the broder Top Level project
address all aspects of OSS Search Software.

Am I the only one that wonders about this as time goes on?

-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Reopened: (LUCENE-857) Remove BitSet caching from QueryFilter

2007-04-10 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man reopened LUCENE-857:
-

Lucene Fields: [New, Patch Available]  (was: [New])

Actually Otis: for the backwards compatibility to work, QueryFilter needs to 
extend CachingWrapperFilter with a constructor like...

   public QueryFilter(Query query) {
 super(new QueryWrapperFilter(query));
   }

...what you've committed eliminates the caching from QueryFilter 

> Remove BitSet caching from QueryFilter
> --
>
> Key: LUCENE-857
> URL: https://issues.apache.org/jira/browse/LUCENE-857
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Otis Gospodnetic
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: LUCENE-857.patch, LUCENE-857.refactoring-approach.diff
>
>
> Since caching is built into the public BitSet bits(IndexReader reader)  
> method, I don't see a way to deprecate that, which means I'll just cut it out 
> and document it in CHANGES.txt.  Anyone who wants QueryFilter caching will be 
> able to get the caching back by wrapping the QueryFilter in the 
> CachingWrapperFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene Branding: the TLP, and "Lucene Java"

2007-04-10 Thread Chris Hostetter

I was motivated to start this thread by LUCENE-860, but it's been in the
back of my mind for a while.

As the Lucene Top Level Project grows and get's more Sub-Projects, I
(personally) have been finding it hard in email/documentation/discussion
to clarify when people are refering to the "Lucene" Top Level Project
versus the "Lucene" java project.  I can't help but wonder if the TLP
should have a different name, or if "Lucene Java" should taken on a more
specific name that doesn't just sound like a name followed a langauge --
ie: JLucene, LuceneJ ... anything that makes it more clear that when the
word "Lucene" is used it's talking about the broder Top Level project
address all aspects of OSS Search Software.

Am I the only one that wonders about this as time goes on?

-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-10 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487962
 ] 

Hoss Man commented on LUCENE-855:
-

On Mon, 9 Apr 2007, Otis Gospodnetic (JIRA) wrote:

: I'd love to know what Hoss and other big Filter users think about this.
: Solr makes a lof of use of (Range?)Filters, I believe.

This is one of those Jira issues that i didn't really have time to follow when 
it was first opened, and so the Jira emails have just been piling up waiting 
ofr me to read.

Here's the raw notes i took as i read through the patches...


FieldCacheRangeFilter.patch  from 10/Apr/07 01:52 PM

 * javadoc cut/paste errors (FieldCache)
 * FieldCacheRangeFilter should work with simple strings
   (using FieldCache.getStrings or FieldCache.getStringIndex)
   just like regular RangeFilter
 * it feels like the various parser versions should be in
   seperate subclasses (common abstract base class?)
 * why does clone need to construct a raw BitSet?  what exactly didn't
   work about ChainedFilter without this?
   (could cause other BitSet usage problems)
 * or/and/andNot/xor can all be implemented using convertToBitSet
 * need FieldCacheBitSet methods: cardinality, get(int,int)
 * need equals and hashCode methods in all new classes
 * FieldCacheBitSet.clear should be UnsuppOp
 * convertToBitSet can be cached.
 * FieldCacheBitSet should be abstract, requiring get(int) be implemented


MemoryCachedRangeFilter_1.4.patch from 06/Apr/07 06:14 AM

 * "tuples" should be initialized to fieldCache.length ... serious
   ArrayList resizing going on there
   (why is it an ArrayList, why not just Tules[] ?)
 * doesn't "cache" need synchronization? ... seems like the same
   CreationPlaceholder pattern used in FieldCache might make sense here.
 * this looks wrong...
 } else if ( (!includeLower) && (lowerIndex >= 0) ) {
   ...consider case where lower==5, includeLower==false, and all values
   in index are 5, binary search could leave us in the middle of hte index,
   so we still need for move forward to the end?
 * ditto above concern for finding upperIndex
 * what is pathological worst case for rewind/forward when *lots* of
   duplicate values in index?  should another binarySearch be used?
 * a lot of code in MemoryCachedRangeFilter.bits for finding
   lowerIndex/upperIndex would probably make more sense as methods in
   SortedFieldCache
 * only seems to handle longs, at a minimum should deal with arbitrary
   strings, with optional add ons for longs/ints/etc...
 * I can't help but wonder how MemoryCachedRangeFilter would compare if it
   used Solr's OpenBitSet (facaded to implement the BitSet API)

TestRangeFilterPerformanceComparison.java   from 10/Apr/07

 * I can't help but wonder how RangeFilter would compare if it used Solr's
   OpenBitSet (facaded to implement the BitSet API)
 * no test of includeLower==false or includeUpper==false
 * i don't think the ranges being compared are the same for RangeFilter as they 
   are for the other Filters ... note the use of DateTools when building the 
index, 
   vs straight string usage in RangeFilter, vs Long.parseLong in 
   MemoryCachedRangeFilter and FieldCacheRangeFilter
 * is it really a fair comparison to call MemoryCachedRangeFilter.warmup
   or FieldCacheRangeFilter.bits outside of the timing code?
   for indexes where the IndexReader is reopened periodicaly this may
   be a significant number to be aware of.


Questions about the legitimacy of the testing aside...

In general, I like the approach of FieldCacheBitSet -- but it should be 
generalized into an "AbstractReadOnlyBitSet" where all methods are implemented 
via get(int) in subclasses -- we should make sure that every method in the 
BitSet API works as advertised in Java1.4.  

I don't really like the various hoops FieldCacheRangeFilter has to jump through 
to support int/float/long ... I think at it's core it should support simple 
Strings, with alternate/sub classes for dealing with other FieldCache formats 
... i just really dislike all the crazy nested ifs to deal with the different 
Parser types, if there's going to be separate constructors for 
longs/floats/ints, they might as well be separate sub-classes.

the really nice thing this has over RangeFilter is that people can index raw 
numeric values without needing to massage them into lexicographically ordered 
Strings (since the FieldCache will take care of parsing them appropriately) 

My gut tells me that the MemoryCachedRangeFilter approach will never ever be 
able to compete with the FieldCacheRangeFilter facading BitSet approach since 
it needs to build the FieldCache, then the SortedFieldCache, then a BitSet 
...it seems like any optimization into that pipeline can always be beaten by 
using the same logic, but then facading the BitSet




> MemoryCachedRangeFilter to boost performance of Range queries
> ---

[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-04-10 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487940
 ] 

Hoss Man commented on LUCENE-584:
-

I'm a little behind on following this issue, but if i can attempt to sum up the 
recent discussion about performance...

   "Migrating towards a "Matcher" API *may* allow some types of Queries to be 
faster in situations where clients can use a MatchCollector instead of a 
HitCollector, but this won't be a silver bullet performance win for all Query 
classes -- just those where some of the score calculations is (or can be) 
isolated to the score method (as opposed to skipTO or next)"

I think it's important to remember the motivation of this issue wasn't to 
improve the speed performance of non-scoring searchers, it was to decouple the 
concept of "Filtering" results away from needing to populate a (potentially 
large) BitSet when the logic neccessary for Filtering can easily be expressed 
in terms of a doc iterator (aka: a Matcher) -- opening up the possibility of 
memory performance improvements.  

A second benefit that has arisen as the issue evolved, has been the API 
generalization of the "Matcher" concept to be a super class of Scorer for 
simpler APIs moving forward.




> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, 
> Filter-20060628.patch, HitCollector-20060628.patch, 
> IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, 
> Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, 
> Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, 
> TestSortedVIntList.java
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-730) Restore top level disjunction performance

2007-04-10 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-730:


Lucene Fields: [New, Patch Available]  (was: [New])

> Restore top level disjunction performance
> -
>
> Key: LUCENE-730
> URL: https://issues.apache.org/jira/browse/LUCENE-730
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: TopLevelDisjunction20061127.patch
>
>
> This patch restores the performance of top level disjunctions. 
> The introduction of BooleanScorer2 had impacted this as reported
> on java-user on 21 Nov 2006 by Stanislav Jordanov.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Failed test: testExpirationTimeDeletionPolicy

2007-04-10 Thread Michael McCandless

"Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:
> Just saw this test fail:
> 
> [junit] Testcase:
> 
> testExpirationTimeDeletionPolicy(org.apache.lucene.index.TestDeletionPolicy):
> FAILED
> [junit] commit point was older than 2.0 seconds but did not get
> deleted
> [junit] junit.framework.AssertionFailedError: commit point was older
> than 2.0 seconds but did not get deleted
> [junit] at
> 
> org.apache.lucene.index.TestDeletionPolicy.testExpirationTimeDeletionPolicy(TestDeletionPolicy.java:229)
> [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> [junit] at
> 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> [junit] at
> 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 
> Is my G4 Powerbook too slow? ;)  It does take 15 minutes to run the
> complete test suite.
> 
> Subsequent runs of just this tests were all successful, but it did fail
> once, as shown above.

Hmmm.  That test verifies that a time based deletion policy (remove a
commit point only if it's older than X seconds) is working properly.
I added it (recently) for LUCENE-710.

OK I think I see where this test is wrongly sensitive to the speed of
the machine it's running on and would then cause a false positive
failure.  I will commit a fix.

Still, Otis, I think you should upgrade to a MacBook Pro :)

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-10 Thread Matt Ericson (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Ericson updated LUCENE-855:


Attachment: FieldCacheRangeFilter.patch

Lets try this again. 

I am very sorry to everyone for the last patch. I had some trouble with my 
environment  not correctly re-building.

I have done ant clean before testing.
Andy take a look at this patch and tell me what you think.



> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Assigned To: Otis Gospodnetic
> Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
> TestRangeFilterPerformanceComparison.java, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-857) Remove BitSet caching from QueryFilter

2007-04-10 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-857.
-

Resolution: Fixed

Thanks for the persistence and patience, Hoss.  I see the light now!  The patch 
wouldn't apply to QueryFilter, so I made changes manually.
Committed.


> Remove BitSet caching from QueryFilter
> --
>
> Key: LUCENE-857
> URL: https://issues.apache.org/jira/browse/LUCENE-857
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Otis Gospodnetic
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: LUCENE-857.patch, LUCENE-857.refactoring-approach.diff
>
>
> Since caching is built into the public BitSet bits(IndexReader reader)  
> method, I don't see a way to deprecate that, which means I'll just cut it out 
> and document it in CHANGES.txt.  Anyone who wants QueryFilter caching will be 
> able to get the caching back by wrapping the QueryFilter in the 
> CachingWrapperFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Failed test: testExpirationTimeDeletionPolicy

2007-04-10 Thread Otis Gospodnetic
Just saw this test fail:

[junit] Testcase: 
testExpirationTimeDeletionPolicy(org.apache.lucene.index.TestDeletionPolicy):   
  FAILED
[junit] commit point was older than 2.0 seconds but did not get deleted
[junit] junit.framework.AssertionFailedError: commit point was older than 
2.0 seconds but did not get deleted
[junit] at 
org.apache.lucene.index.TestDeletionPolicy.testExpirationTimeDeletionPolicy(TestDeletionPolicy.java:229)
[junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[junit] at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
[junit] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

Is my G4 Powerbook too slow? ;)  It does take 15 minutes to run the complete 
test suite.

Subsequent runs of just this tests were all successful, but it did fail once, 
as shown above.

Otis



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-860) site should call project "Lucene Java", not just "Lucene"

2007-04-10 Thread Doug Cutting (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated LUCENE-860:


Lucene Fields: [Patch Available]  (was: [New])

> site should call project "Lucene Java", not just "Lucene"
> -
>
> Key: LUCENE-860
> URL: https://issues.apache.org/jira/browse/LUCENE-860
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Website
>Reporter: Doug Cutting
>Priority: Minor
> Attachments: LUCENE-860.patch
>
>
> To avoid confusion with the top-level Lucene project, the Lucene Java website 
> should refer to itself as Lucene Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-860) site should call project "Lucene Java", not just "Lucene"

2007-04-10 Thread Doug Cutting (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated LUCENE-860:


Attachment: LUCENE-860.patch

Here's a patch that replaces "Apache Lucene" with "Apache Lucene Java" in the 
website.  It also fixes the breadcrumbs at the top of the web pages and the 
links on the logos.

Is "Apache Lucene Java" too verbose?  Should we instead just use "Lucene Java"?

> site should call project "Lucene Java", not just "Lucene"
> -
>
> Key: LUCENE-860
> URL: https://issues.apache.org/jira/browse/LUCENE-860
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Website
>Reporter: Doug Cutting
>Priority: Minor
> Attachments: LUCENE-860.patch
>
>
> To avoid confusion with the top-level Lucene project, the Lucene Java website 
> should refer to itself as Lucene Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-860) site should call project "Lucene Java", not just "Lucene"

2007-04-10 Thread Doug Cutting (JIRA)
site should call project "Lucene Java", not just "Lucene"
-

 Key: LUCENE-860
 URL: https://issues.apache.org/jira/browse/LUCENE-860
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Doug Cutting
Priority: Minor


To avoid confusion with the top-level Lucene project, the Lucene Java website 
should refer to itself as Lucene Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-10 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487897
 ] 

Andy Liu commented on LUCENE-855:
-

Hey Matt, I get this exception when running your newest FCRF with the 
performance test.  Can you check to see if you get this also?

java.lang.ArrayIndexOutOfBoundsException: 10
at 
org.apache.lucene.search.FieldCacheRangeFilter$5.get(FieldCacheRangeFilter.java:231)
at 
org.apache.lucene.search.IndexSearcher$1.collect(IndexSearcher.java:136)
at org.apache.lucene.search.Scorer.score(Scorer.java:49)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:113)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:74)
at org.apache.lucene.search.Hits.(Hits.java:53)
at org.apache.lucene.search.Searcher.search(Searcher.java:46)
at 
org.apache.lucene.misc.TestRangeFilterPerformanceComparison$Benchmark.go(TestRangeFilterPerformanceComparison.java:312)
at 
org.apache.lucene.misc.TestRangeFilterPerformanceComparison.testPerformance(TestRangeFilterPerformanceComparison.java:201)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at junit.framework.TestCase.runTest(TestCase.java:154)
at junit.framework.TestCase.runBare(TestCase.java:127)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:118)
at junit.framework.TestSuite.runTest(TestSuite.java:208)
at junit.framework.TestSuite.run(TestSuite.java:203)
at 
org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:128)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)



> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Assigned To: Otis Gospodnetic
> Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, 
> MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The

Re: Why ORScorer delayed init?

2007-04-10 Thread Chris Hostetter

: I thought it would avoid accessing the index as much as
: possible before actually doing a search, but I did not
: verify whether that is important.
: In case it is not, any simplification is off course welcome.

conceptually: once Query.createWeight(Searcher) is called, the "Search"
has already begun hasn't it?  ... if not then, at the very least when
Weight.scorer(IndexReader) is called i would imagine.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why ORScorer delayed init?

2007-04-10 Thread Paul Elschot
On Tuesday 10 April 2007 20:24, Yonik Seeley wrote:
> On 4/10/07, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
> > In DisjunctionSumScorer, both skipTo() and next() invoke
> > initScorerDocQueue() on the first iteration.  However, since all
> > subscorers are added en masse via the constructor instead of
> > individually via an add() method which does not exist for this class,
> > it would be possible to trigger initScorerDocQueue() at construction
> > time rather than defer it, slightly simplifying the inner loop methods.
> 
> Yes, I think I made this change to one or two of the other scorers in the 
past.
> It makes more sense to me to pass everything needed in the constructor
> and get rid of the firstTime checks in next() and skipTo()

I kept this method of initializing because it was present in some
other existing Scorers. I did not really like it at the time either.

I thought it would avoid accessing the index as much as
possible before actually doing a search, but I did not
verify whether that is important.
In case it is not, any simplification is off course welcome.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-04-10 Thread Paul Elschot
On Tuesday 10 April 2007 17:41, eks dev wrote:
> 
> If I remember well, the last time we profiled search with "high density"  OR 
queries  scoring was taking up to 30% of the time. This was a 8Mio collection 
of short documents fitting comfortably in RAM. So I am sure disabling scoring 
in some cases could bring us something. 
> 
> I am not all that familiar with scoring inner workings to stand 100% behind 
this statement, so please take it with some healthy reserve.

For "high density OR" I'd guess most of the work was spent maintaining
the priority queue by document number. See also LUCENE-730 .

> 
> But anyhow, with Matcher in place, we have at least a chance to prove it 
brings something for this scenario. For Filtering case it brings definitely a 
lot. 
> 
> on the other note, 
> Paul, would it be possible/easy to have something like. It looks easy to add 
it, but I may be missing something: 
> BooleanQuery.add(Matcher mtr,
> BooleanClause.Occur occur)

That's one of the things I'd like to see added. It would allow a single
ConjunctionScorer to do a filtered search for a query with some
required terms.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-04-10 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487882
 ] 

Paul Elschot commented on LUCENE-584:
-

By fastest cache I meant the L1 cache of the processor. The size is normally in 
tens of kilobytes.
An array lookup hitting that cache takes about as much time as a floating point 
addition.

During a query search the use of a.o. the term frequencies, the proximity data, 
and the document weights normally cause an L1 cache miss.

I would expect that by not doing the score value computations, only the cache 
misses for document weights can be saved.


> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, 
> Filter-20060628.patch, HitCollector-20060628.patch, 
> IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, 
> Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, 
> Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, 
> TestSortedVIntList.java
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why ORScorer delayed init?

2007-04-10 Thread Yonik Seeley

On 4/10/07, Marvin Humphrey <[EMAIL PROTECTED]> wrote:

In DisjunctionSumScorer, both skipTo() and next() invoke
initScorerDocQueue() on the first iteration.  However, since all
subscorers are added en masse via the constructor instead of
individually via an add() method which does not exist for this class,
it would be possible to trigger initScorerDocQueue() at construction
time rather than defer it, slightly simplifying the inner loop methods.


Yes, I think I made this change to one or two of the other scorers in the past.
It makes more sense to me to pass everything needed in the constructor
and get rid of the firstTime checks in next() and skipTo()

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-794) SpanScorer and SimpleSpanFragmenter for Contrib Highlighter

2007-04-10 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487860
 ] 

Mark Miller commented on LUCENE-794:


Sorry Sean, I forgot to mention that the patch is off of the latest 
Lucene trunk code.

The range query test should fail because they switched the query parser 
to return a constant score query instead of a range query. Cannot 
highlight a constant score query.

- Mark



> SpanScorer and SimpleSpanFragmenter for Contrib Highlighter
> ---
>
> Key: LUCENE-794
> URL: https://issues.apache.org/jira/browse/LUCENE-794
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
>Priority: Minor
> Attachments: CachedTokenStream.java, CachedTokenStream.java, 
> CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, 
> Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, 
> Highlighter.java, HighlighterTest.java, HighlighterTest.java, 
> HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, 
> QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, 
> QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, 
> spanhighlighter2.patch, spanhighlighter3.patch, spanhighlighter5.patch, 
> spanhighlighter_patch_4.zip, SpanHighlighterTest.java, 
> SpanHighlighterTest.java, SpanScorer.java, SpanScorer.java, 
> WeightedSpanTerm.java
>
>
> This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter 
> package that scores just like QueryScorer, but scores a 0 for Terms that did 
> not cause the Query hit. This gives 'actual' hit highlighting for the range 
> of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts 
> to fragment without breaking up Spans.
> See http://issues.apache.org/jira/browse/LUCENE-403 for some background.
> There is a dependency on MemoryIndex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-10 Thread Matt Ericson (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Ericson updated LUCENE-855:


Attachment: FieldCacheRangeFilter.patch

Fixed a bug with the BitSets nextSetBit(i) and nextClearBit(i) I wrote a test 
to verify that it returns the same values as a Normal BitSet . I dont use these 
functions if someone wants to verify my fix that would be great.

Added the ASF to the top of each file 
And fixed all of Otis bugs


> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Assigned To: Otis Gospodnetic
> Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, 
> MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-10 Thread Matt Ericson (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Ericson updated LUCENE-855:


Attachment: TestRangeFilterPerformanceComparison.java

Andy thank you for that test 

I took at Moved it to contrib/miscellaneous and added a few more tests 
including the Chained Filter test. Here is my version. Also I fixed a few bugs 
in my code that I will be attaching next .

I also reformatted my results I think they are a little easer to read. 
Here is what I get and your right if you use a MatchAllDocsQuery our 2 version 
of the code are about the same 

[junit] - Standard Output ---
[junit] Start interval: Thu Apr 11 10:55:02 PDT 2002
[junit] End interval: Tue Apr 10 10:55:02 PDT 2007
[junit] Creating RAMDirectory index...
[junit] Reader opened with 10 documents.  Creating RangeFilters...

[junit] TermQuery

[junit] FieldCacheRangeFilter
[junit]   * Total: 13ms
[junit]   * Bits: 0ms
[junit]   * Search: 9ms
[junit] MemoryCachedRangeFilter
[junit]   * Total: 209ms
[junit]   * Bits: 90ms
[junit]   * Search: 115ms
[junit] RangeFilter
[junit]   * Total: 12068ms
[junit]   * Bits: 6009ms
[junit]   * Search: 6051ms
[junit] Chained FieldCacheRangeFilter
[junit]   * Total: 15ms
[junit]   * Bits: 1ms
[junit]   * Search: 10ms
[junit] Chained MemoryCachedRangeFilter
[junit]   * Total: 177ms
[junit]   * Bits: 83ms
[junit]   * Search: 90ms

[junit] ConstantScoreQuery

[junit] FieldCacheRangeFilter
[junit]   * Total: 480ms
[junit]   * Bits: 1ms
[junit]   * Search: 474ms
[junit] MemoryCachedRangeFilter
[junit]   * Total: 757ms
[junit]   * Bits: 90ms
[junit]   * Search: 663ms
[junit] RangeFilter
[junit]   * Total: 18749ms
[junit]   * Bits: 6083ms
[junit]   * Search: 12655ms
[junit] Chained FieldCacheRangeFilter
[junit]   * Total: 11ms
[junit]   * Bits: 0ms
[junit]   * Search: 8ms
[junit] Chained MemoryCachedRangeFilter
[junit]   * Total: 776ms
[junit]   * Bits: 87ms
[junit]   * Search: 682ms

[junit] MatchAllDocsQuery

[junit] FieldCacheRangeFilter
[junit]   * Total: 1344ms
[junit]   * Bits: 5ms
[junit]   * Search: 1334ms
[junit] MemoryCachedRangeFilter
[junit]   * Total: 1468ms
[junit]   * Bits: 81ms
[junit]   * Search: 1381ms
[junit] RangeFilter
[junit]   * Total: 13360ms
[junit]   * Bits: 6091ms
[junit]   * Search: 7254ms
[junit] Chained FieldCacheRangeFilter
[junit]   * Total: 924ms
[junit]   * Bits: 4ms
[junit]   * Search: 916ms
[junit] Chained MemoryCachedRangeFilter
[junit]   * Total: 1507ms
[junit]   * Bits: 84ms
[junit]   * Search: 1415ms
[junit] -  ---


> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Assigned To: Otis Gospodnetic
> Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
> TestRangeFilterPerformanceComparison.java, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCache

[jira] Commented: (LUCENE-794) SpanScorer and SimpleSpanFragmenter for Contrib Highlighter

2007-04-10 Thread Sean O'Connor (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487847
 ] 

Sean O'Connor commented on LUCENE-794:
--

I was able to apply the spanhighlighter5.patch. I'm inexperienced with ant and 
svn, so I assume the slight troubles I had were self-inflicted; I mention them 
in case they are of any help.

I might have missed something, but my MemoryIndex.java seemed to be missing the 
implementation of the abstract isPayloadAvailable() method from TermPositions. 
That was causing my build to fail, so I added the method, simply returning 
false.

After that change, the tests run, and life was good again. I do get a failed 
test at 
org.apache.lucene.search.highlight.HighlighterTest.testGetRangeFragments(HighlighterTest.java:137),
 but it looks like that might be expected. The search is "[kannedy TO kznnedy]".

I am now looking into getting the total number of hits for a given query (for 
un-normalized scoring), and the hit positions (saved for larger scale analysis 
and browsing). I have code that does this, but hope I can improve on my 
existing approach by using this highlighting patch.
Thanks,

Sean


> SpanScorer and SimpleSpanFragmenter for Contrib Highlighter
> ---
>
> Key: LUCENE-794
> URL: https://issues.apache.org/jira/browse/LUCENE-794
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
>Priority: Minor
> Attachments: CachedTokenStream.java, CachedTokenStream.java, 
> CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, 
> Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, 
> Highlighter.java, HighlighterTest.java, HighlighterTest.java, 
> HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, 
> QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, 
> QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, 
> spanhighlighter2.patch, spanhighlighter3.patch, spanhighlighter5.patch, 
> spanhighlighter_patch_4.zip, SpanHighlighterTest.java, 
> SpanHighlighterTest.java, SpanScorer.java, SpanScorer.java, 
> WeightedSpanTerm.java
>
>
> This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter 
> package that scores just like QueryScorer, but scores a 0 for Terms that did 
> not cause the Query hit. This gives 'actual' hit highlighting for the range 
> of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts 
> to fragment without breaking up Spans.
> See http://issues.apache.org/jira/browse/LUCENE-403 for some background.
> There is a dependency on MemoryIndex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Why ORScorer delayed init?

2007-04-10 Thread Marvin Humphrey

Greets,

In DisjunctionSumScorer, both skipTo() and next() invoke  
initScorerDocQueue() on the first iteration.  However, since all  
subscorers are added en masse via the constructor instead of  
individually via an add() method which does not exist for this class,  
it would be possible to trigger initScorerDocQueue() at construction  
time rather than defer it, slightly simplifying the inner loop methods.


Does the delay offer some advantage that I'm missing?  It looks like  
an artifact.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Hudson build is back to normal: Lucene-Nightly #53

2007-04-10 Thread hudson
See http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/53/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-04-10 Thread eks dev

If I remember well, the last time we profiled search with "high density"  OR 
queries  scoring was taking up to 30% of the time. This was a 8Mio collection 
of short documents fitting comfortably in RAM. So I am sure disabling scoring 
in some cases could bring us something. 

I am not all that familiar with scoring inner workings to stand 100% behind 
this statement, so please take it with some healthy reserve.

But anyhow, with Matcher in place, we have at least a chance to prove it brings 
something for this scenario. For Filtering case it brings definitely a lot. 

on the other note, 
Paul, would it be possible/easy to have something like. It looks easy to add 
it, but I may be missing something: 
BooleanQuery.add(Matcher mtr,
BooleanClause.Occur occur)



- Original Message 
From: Otis Gospodnetic (JIRA) <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Tuesday, 10 April, 2007 5:11:32 PM
Subject: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet


[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487789
 ] 

Otis Gospodnetic commented on LUCENE-584:
-

Ah, too bad. :(
Last time I benchmarked Lucene searching on Sun's Niagara vs. non-massive Intel 
boxes, Intel boxes with Linux on them actually won, and my impression was that 
this was due to Niagara's weak FPU (a known weakness in Niagara, I believe).  
Thus, I thought, if we could just skip scoring and various floating point 
calculations, we'd see better performance, esp. on Niagara boxes.

Paul, when you say "fastest cache", what exactly are you referring to?  The 
Niagara I tested things on had 32GB of RAM, and I gave the JVM 20+GB, so at 
least the JVM had plenty of RAM to work with.


> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, 
> Filter-20060628.patch, HitCollector-20060628.patch, 
> IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, 
> Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, 
> Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, 
> TestSortedVIntList.java
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






  ___ 
Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for
your free account today 
http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-04-10 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487789
 ] 

Otis Gospodnetic commented on LUCENE-584:
-

Ah, too bad. :(
Last time I benchmarked Lucene searching on Sun's Niagara vs. non-massive Intel 
boxes, Intel boxes with Linux on them actually won, and my impression was that 
this was due to Niagara's weak FPU (a known weakness in Niagara, I believe).  
Thus, I thought, if we could just skip scoring and various floating point 
calculations, we'd see better performance, esp. on Niagara boxes.

Paul, when you say "fastest cache", what exactly are you referring to?  The 
Niagara I tested things on had 32GB of RAM, and I gave the JVM 20+GB, so at 
least the JVM had plenty of RAM to work with.


> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, 
> Filter-20060628.patch, HitCollector-20060628.patch, 
> IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, 
> Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, 
> Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, 
> TestSortedVIntList.java
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Maven artifacts for Lucene.*

2007-04-10 Thread Sami Siren

I have been hoping to put up mechanism for (easier) deployment of m2
artifacts to maven repositories (both Apache snapshot repository and the
main maven repository at ibiblio).

The most convenient way would be to use maven2 to build the various lucene
projects but as the mailing list conversation about this subject
indicates there is no common interest for changing the (working) ant based
build system to a maven based.

The next best thing IMO would be using ant build as normally for the non
maven2 releases and use maven2 for building the maven releases (.jar
files, optionally also packages for sources used to build the binary and
packages for javadocs) with related check sums and signatures.

To repeat it one more time: what I am proposing here is not meant to replace
the current solid way of building the various Lucene projects -
I am just trying to provide a convenient way to make the release artifacts
to be deployed to maven repositories.

I have put together an initial set of poms (for lucene-java) to do this
quite easily, basically what is required is installation of maven2
binaries and the set of pom files and a checkout of the lucene version to
build.

The various jars are build, packaged, check summed, signed and optionally
deployed with single mvn command. So IMO it is quite easy thing to do in
addition to normal release process.

I can also, for undefined time, volunteer to do these builds if it is too
much of burden for RMs.

There are however couple of things I need your opinion about (or at least
attention):

1. There are differencies when comparing to ant build jars (due to release
policy of a.o) the built jars will contain LICENSE.txt,
NOTICE.txt in /META-INF. Is this a problem?

2. I propose that we add additional folder level so the groupId for lucene
java would be org.apache.lucene.java (it is now org.apache.lucene
within the currently released artifacts). The initial list of artifacts (the
new proposed structure) is listed below:

groupId:org.apache.lucene
lucene-parent (pom) (a top level pom defining lucene wide stuff that
gets inherited to sub project modules)

groupId:org.apache.lucene.java
java-parent (pom)
lucene-core (jar)
lucene-demos (jar)
contrib-parent (pom)
lucene-analyzers (jar)
lucene-benchmark (jar)
lucene-highlighter(jar)
lucene-misc (jar)
lucene-queries (jar)
lucene-regex (jar)
lucene-snowball (jar)
lucene-spellchecker(jar)
lucene-surround (jar)
lucene-swing (jar)
lucene-wordnet (jar)
lucene-xml-query-parser (jar)

groupId:org.apache.lucene.nutch (TODO)
nutch-parent (pom)
nutch-core (jar)
nutch-plugins (pom)
nutch-plugin-x (jar) (as soon as nutch plugins can be of format
.jar)
...

groupId:org.apache.lucene.hadoop (TODO)
hadoop-parent (pom)
hadoop-core (jar)
hadoop-streaming (jar)
...

groupId:org.apache.lucene.solr (TODO)
solr-parent (pom)
solr-core (jar)
...

3. Where to put poms? They need to be put somewhere. I think it's not smart
at this point pollute the ant driven folder structure with
poms - they are better of in separate dir structure. What is (in your
opinion) the most convenient place for them?

I would propose that every sub project would have dir named maven (or
something similar) that would contain poms for that particular sub project.

Other possibility would be putting a lucene level dir for maven stuff and
the poms would be maintained there.



The text above was my initial thought about this, however there have been
concerns that the procedure described here might not be most optimal one. So
far the arguments have been the following:

1. Two build systems to maintain

True. However I don't quite see that so black and white: You would anyway
need to maintain the poms manually (if you care about the quality of poms)
or you would have to build some mechanism to build those. Of course in
situation where you would not actually build with maven the poms could be a
bit more simple.

2. Two build systems producing different jars, would maven2 releases require
a separate vote?

Yes the artifacts (jars) would be different, because you would need to add
LICENSE and MANIFEST into them (because of apache policy). I don't know
about the vote, how do other projects deal with this kind of situation,
anyone here to tell?

One solution to jar mismatch would be changing the ant build to put those
files in produced jars.

3. Additional burden for RM, need to run additional command, install maven

There will be that external step for doing the maven release and you need to
install maven also. But compared to current situation where you would have
to extract jars, put some more files into them, sign them, modify poms to
reflect correct version numbers, upload them to repositories manually.

The other way to do is would be changing the current build system to be more
maven friendly. This would probably mean following things:

-add poms for artifacts into svn repository (where?)
-adding LICENSE and NOTICE into jars.
-add ant target to
-sign jars
-push artifacts into staging

[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-04-10 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487706
 ] 

Paul Elschot commented on LUCENE-584:
-

That could be improved in a DisjunctionMatcher.
With a bit of bookkeeping DisjunctionSumScorer could also delay calling score() 
on the subscorers
but the bookkeeping would affect performance for the normal case.

For the usual queries the score() call will never have much of a performance 
impact.
The reason for this is that TermScorer.score() is really very efficient, iirc 
it caches
weighted tf() values for low term frequencies.
All the rest is mostly additions, and occasionally a multiplication for a 
coordination factor.

To determine which documents match the query, the index need to be accessed,
and that takes more time than score value computations because the complete 
index
almost never fits in the fastest cache.



> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, 
> Filter-20060628.patch, HitCollector-20060628.patch, 
> IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, 
> Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, 
> Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, 
> TestSortedVIntList.java
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]