[jira] Updated: (LUCENE-857) Remove BitSet caching from QueryFilter

2007-04-09 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-857:


Attachment: LUCENE-857.refactoring-approach.diff

An example of what I'm thinking would make sense from a backwards compatibility 
standpoint ... javadocs could still use some improvement.

> Remove BitSet caching from QueryFilter
> --
>
> Key: LUCENE-857
> URL: https://issues.apache.org/jira/browse/LUCENE-857
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Otis Gospodnetic
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: LUCENE-857.patch, LUCENE-857.refactoring-approach.diff
>
>
> Since caching is built into the public BitSet bits(IndexReader reader)  
> method, I don't see a way to deprecate that, which means I'll just cut it out 
> and document it in CHANGES.txt.  Anyone who wants QueryFilter caching will be 
> able to get the caching back by wrapping the QueryFilter in the 
> CachingWrapperFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-857) Remove BitSet caching from QueryFilter

2007-04-09 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487679
 ] 

Hoss Man commented on LUCENE-857:
-

I don't think it's a question of being careless about reading the Changelog -- 
I just think that when dealing with a point release, we shouldn't require 
people to make code changes just to get the same behavior as before ... if this 
was necessary to fix a bug it would be one thing, but really what we're talking 
about here is refactoring out a piece of functionality (using a Query as a 
Filter) so that it can be used independently from another piece of 
functionality (filter caching) ... since that can be done in a backwards 
compatible way, why not make it easy for people.

> With your suggestion one can't get a raw QueryFilter without getting it 
> automatically cached. Isn't this inflexibility uncool? 

...not quite, I'm suggesting that the "raw" QueryFilter behavior be extracted 
into a new class (QueryWrapperFilter) and the existing QueryFilter class 
continue to do exactly what it currently does - but refactored so that there is 
no duplicate code.

> Remove BitSet caching from QueryFilter
> --
>
> Key: LUCENE-857
> URL: https://issues.apache.org/jira/browse/LUCENE-857
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Otis Gospodnetic
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: LUCENE-857.patch
>
>
> Since caching is built into the public BitSet bits(IndexReader reader)  
> method, I don't see a way to deprecate that, which means I'll just cut it out 
> and document it in CHANGES.txt.  Anyone who wants QueryFilter caching will be 
> able to get the caching back by wrapping the QueryFilter in the 
> CachingWrapperFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-04-09 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487675
 ] 

Marvin Humphrey commented on LUCENE-584:


DisjunctionSumScorer (the ORScorer) actually calls Scorer.score() on all of the 
matching scorers in the ScorerDocQueue during next(), in order to accumulate an 
aggregate score.  The MatchCollector can't save you from that.

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, 
> Filter-20060628.patch, HitCollector-20060628.patch, 
> IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, 
> Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, 
> Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, 
> TestSortedVIntList.java
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-04-09 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487674
 ] 

Otis Gospodnetic commented on LUCENE-584:
-

A.  I'll look at the patch again tomorrow and follow what you said.  All 
this time I was under the impression that one of the points or at least 
side-effects of the Matcher was that scoring was skipped, which would be 
perfect where matches are ordered by anything other than relevance.



> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, 
> Filter-20060628.patch, HitCollector-20060628.patch, 
> IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, 
> Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, 
> Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, 
> TestSortedVIntList.java
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-04-09 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487667
 ] 

Doron Cohen commented on LUCENE-584:


> No Scorer, no BooleanScorer(2), no ConjunctionScorer... 

Thanks, I was reading "score" instead of "score()"...

But there is a scorer in the process, it is used for next()-ing to matched 
docs. So most of the work - preparing to be able to compute the scores - was 
done already. The scorer doc queue is created and populated. Not calling 
score() is saving the (final) looping on the scorers for aggregating their 
scores, multiplying by coord factor, etc. I assume this is why only a small 
speed up is seen. 


> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, 
> Filter-20060628.patch, HitCollector-20060628.patch, 
> IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, 
> Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, 
> Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, 
> TestSortedVIntList.java
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-794) SpanScorer and SimpleSpanFragmenter for Contrib Highlighter

2007-04-09 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-794:
---

Attachment: spanhighlighter5.patch

Apologize for the delay on this -- I was pulled into a busy product launch.

This adds the final piece, replacing TermModifer with multiple Memory Indexes.

I also did a little refactoring, especially in the SpansExtractor.

All tests now pass and I have been using this succesfully for some time now.

For anyone new following this issue, ignore all of the files except for this 
one: spanhighlighter5.patch

- Mark

> SpanScorer and SimpleSpanFragmenter for Contrib Highlighter
> ---
>
> Key: LUCENE-794
> URL: https://issues.apache.org/jira/browse/LUCENE-794
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
>Priority: Minor
> Attachments: CachedTokenStream.java, CachedTokenStream.java, 
> CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, 
> Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, 
> Highlighter.java, HighlighterTest.java, HighlighterTest.java, 
> HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, 
> QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, 
> QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, 
> spanhighlighter2.patch, spanhighlighter3.patch, spanhighlighter5.patch, 
> spanhighlighter_patch_4.zip, SpanHighlighterTest.java, 
> SpanHighlighterTest.java, SpanScorer.java, SpanScorer.java, 
> WeightedSpanTerm.java
>
>
> This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter 
> package that scores just like QueryScorer, but scores a 0 for Terms that did 
> not cause the Query hit. This gives 'actual' hit highlighting for the range 
> of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts 
> to fragment without breaking up Spans.
> See http://issues.apache.org/jira/browse/LUCENE-403 for some background.
> There is a dependency on MemoryIndex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large scale sorting

2007-04-09 Thread Paul Smith


A memory saving optimization would be to not load the corresponding
String[] in the string index (as discussed previously), but there is
currently no way to tell the FieldCachethat the strings are unneeded.
The String values are only needed for merging results in a
MultiSearcher.


Yep, which happens all the time for us specifically, because we have  
an 'archive' and 'week' index. the week index is merged once per  
week, so the search is always a merged sort across the 2. (the week  
index is reloaded every 5 seconds or so, the archive index is kept in  
memory once loaded).






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large scale sorting

2007-04-09 Thread Yonik Seeley

On 4/9/07, jian chen <[EMAIL PROTECTED]> wrote:

But, on a higher level, my idea is really just to create an array of
integers for each sort field. The array length is NumOfDocs in the index.
Each integer corresponds to a displayable string value. For example, if you
have a field of different colors, you can assign integers like this:

0 <=> whilte
1 <=> blue
2 <=> yellow
...

Thus, you don't need to use strings for sorting.


This is how it is currently done.  Sorting using an IndexSearcher does
not do string comparisons at all, but just compares their ordinal
retrieved from an int[]

A memory saving optimization would be to not load the corresponding
String[] in the string index (as discussed previously), but there is
currently no way to tell the FieldCachethat the strings are unneeded.
The String values are only needed for merging results in a
MultiSearcher.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large scale sorting

2007-04-09 Thread jian chen

Hi, Paul,

I think to warm-up or not, it needs some benchmarking for specific
application.

For the implementation of the sort fields, when I talk about norms in
Lucene, I am thinking we could borrow the same implmentation of the norms to
do it.

But, on a higher level, my idea is really just to create an array of
integers for each sort field. The array length is NumOfDocs in the index.
Each integer corresponds to a displayable string value. For example, if you
have a field of different colors, you can assign integers like this:

0 <=> whilte
1 <=> blue
2 <=> yellow
...

Thus, you don't need to use strings for sorting. For example, if you have
document number 0,1,2, which stores colors blue, white, yellow respectively,
the array would be:

{1, 0, 2}.

To do sorting, this array could be pre-loaded into memory (warming up the
index), or, during collecting the hits (in HitCollector), the relevant
integer values could be loaded from disk given a doc id.

If you have 10 million documents, for one sort field, you will have 10x4=40
MB array.

Cheers,

Jian


On 4/9/07, Paul Smith <[EMAIL PROTECTED]> wrote:


>
> In our application, we have to sync up the index pretty frequently,
> the
> warm-up of the index is killing it.
>

Yep, it speeds up the first sort, but at the cost of making all the
others slower (maybe significantly so).  That's obviously not ideal
but could make use of sorts in larger indexes practical.

> To address your concern about single sort locale, what about
> creating a sort
> field for each sort locale? So, if you have, say, 10 locales, you
> will have
> 10 sort fields, each utilizing the mechanism of constructing the
> norms.
>

I really don't understand norms properly so I'm not sure exactly how
that would help.  I'll have to go over your original email again to
understand.

My main goal is to get some discussion going amongst the community,
which hopefully we've kicked along.


Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Large scale sorting

2007-04-09 Thread Paul Smith


In our application, we have to sync up the index pretty frequently,  
the

warm-up of the index is killing it.



Yep, it speeds up the first sort, but at the cost of making all the  
others slower (maybe significantly so).  That's obviously not ideal  
but could make use of sorts in larger indexes practical.


To address your concern about single sort locale, what about  
creating a sort
field for each sort locale? So, if you have, say, 10 locales, you  
will have
10 sort fields, each utilizing the mechanism of constructing the  
norms.




I really don't understand norms properly so I'm not sure exactly how  
that would help.  I'll have to go over your original email again to  
understand.


My main goal is to get some discussion going amongst the community,  
which hopefully we've kicked along.



Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large scale sorting

2007-04-09 Thread jian chen

Hi, Paul,

Thanks for your reply. For your previous email about the need for disk based
sorting solution, I kind of agree about your points. One incentive for your
approach is that we don't need to warm-up the index anymore in case that the
index is huge.

In our application, we have to sync up the index pretty frequently, the
warm-up of the index is killing it.

To address your concern about single sort locale, what about creating a sort
field for each sort locale? So, if you have, say, 10 locales, you will have
10 sort fields, each utilizing the mechanism of constructing the norms.

At query time, in the HitCollector, for each doc id matched, you can load
the field value (integer) through the IndexReader. (here you need to enhance
the IndexReader to be able to load the sort field values). Then, you can use
that value to reject/accept the doc, or factor into the score.

How do you think?

Jian



On 4/9/07, Paul Smith <[EMAIL PROTECTED]> wrote:


>
> Now, if we could use integers to represent the sort field values,
> which is
> typically the case for most applications, maybe we can afford to
> have the
> sort field values stored in the disk and do disk lookup for each
> document
> matched? The look up of the sort field value will be as simple as
> docNo * 4
> * offset.
>
> This way, we use the same approach as constructing the norms
> (proper merging
> for incremental indexing), but, at search time, we don't load the
> sort field
> values into memory, instead, just store them in disk.
>
> Will this approach be good enough?

While a nifty idea, I think this only works for a single sort
locale.  I initially came up with a similar idea that the terms are
already stored in 'sorted' order and one might be able to use the
terms position for sorting, it's just that the terms ordering
position is different in different locales.

Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




[jira] Commented: (LUCENE-859) Expose the number of deleted docs in index/segment

2007-04-09 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487644
 ] 

Yonik Seeley commented on LUCENE-859:
-

> Though it might still be handy to have something with main() that spits out 
> the number of deleted
> documents, as SegmentReader has in my patch.

I don't understand that comment.  I don't see anything in your patch besides 
the implementation of deletedDocs().

> Maybe that should be added to the existing IndexReader.main ?

That sounds fine.

> Expose the number of deleted docs in index/segment
> --
>
> Key: LUCENE-859
> URL: https://issues.apache.org/jira/browse/LUCENE-859
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Otis Gospodnetic
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: LUCENE-859
>
>
> Use case:
> We've got a lot of large, mostly search-only indices. These indices are not 
> re-optimized once "deployed".   Docs in them do not get updated, but they do 
> get deleted.  After a while, the number of deleted docs grows, but it's hard 
> to tell how many documents have been deleted.
> Exposing the number of deleted docs via *Reader.deletedDocs() method let's 
> you get to this number.
> I'm attaching  patch that touches the following:
> M  src/test/org/apache/lucene/index/TestSegmentReader.java
> M  src/java/org/apache/lucene/index/MultiReader.java
> M  src/java/org/apache/lucene/index/IndexReader.java
> M  src/java/org/apache/lucene/index/FilterIndexReader.java
> M  src/java/org/apache/lucene/index/ParallelReader.java
> M  src/java/org/apache/lucene/index/SegmentReader.java
> SegmentReader also got a public static main(String[]) that takes 1 
> command-line parameter, a path to the index to check, and prints out the 
> number of deleted docs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large scale sorting

2007-04-09 Thread Doug Cutting

Paul Smith wrote:
I don't disagree with the premise that it involves substantial I/O and 
would increase the time taken to sort, and why this approach shouldn't 
be the default mechanism, but it's not too difficult to build a disk I/O 
subsystem that can allocate many spindles to service this and to allow 
the underlying OS to use it's buffer cache (yes this is sounding like a 
database server now isn't it).


My guess is that it'd be cheaper to just buy more RAM.

It would be better if the sorting 
mechanism in Lucene was a little more decoupled such that more 
customised designs could be utilitised for specific scenarios.  Right 
now it's a one-for-all approach without substantial gutting of the code.


That's just what most folks have found useful to date.  If you have a 
patch to decouple it, and others find it useful, then it should be 
seriously considered.  I do have some concerns about whether the 
approach you suggest is in fact useful, but am happy to be proven wrong.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large scale sorting

2007-04-09 Thread Paul Smith


Now, if we could use integers to represent the sort field values,  
which is
typically the case for most applications, maybe we can afford to  
have the
sort field values stored in the disk and do disk lookup for each  
document
matched? The look up of the sort field value will be as simple as  
docNo * 4

* offset.

This way, we use the same approach as constructing the norms  
(proper merging
for incremental indexing), but, at search time, we don't load the  
sort field

values into memory, instead, just store them in disk.

Will this approach be good enough?


While a nifty idea, I think this only works for a single sort  
locale.  I initially came up with a similar idea that the terms are  
already stored in 'sorted' order and one might be able to use the  
terms position for sorting, it's just that the terms ordering  
position is different in different locales.


Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large scale sorting

2007-04-09 Thread Paul Smith


On 10/04/2007, at 4:18 AM, Doug Cutting wrote:


Paul Smith wrote:

Disadvantages to this approach:
* It's a lot more I/O intensive


I think this would be prohibitive.  Queries matching more than a  
few hundred documents will take several seconds to sort, since  
random disk accesses are required per matching document.  Such an  
approach is only practical if you can guarantee that queries match  
fewer than a hundred documents, which is not generally the case,  
especially with large collections.




I don't disagree with the premise that it involves substantial I/O  
and would increase the time taken to sort, and why this approach  
shouldn't be the default mechanism, but it's not too difficult to  
build a disk I/O subsystem that can allocate many spindles to service  
this and to allow the underlying OS to use it's buffer cache (yes  
this is sounding like a database server now isn't it).


I'm working on the basis that it's a LOT harder/more expensive to  
simply allocate more heap size to cover the current sorting  
infrastructure.   One hits memory limits faster.  Not everyone can  
afford 64-bit hardware with many Gb RAM to allocate to a heap.  It  
_is_ cheaper/easier to build a disk subsystem to tune this I/O  
approach, and one can still use any RAM as buffer cache for the  
memory-mapped file anyway.


In my experience, raw search time starts to climb towards one  
second per query as collections grow to around 10M documents (in  
round figures and with lots of assumptions).  Thus, searching on a  
single CPU is less practical as collections grow substantially  
larger than 10M documents, and distributed solutions are required.   
So it would be convenient if sorting is also practical for ~10M  
document collections on standard hardware.  If 10M strings with 20  
characters are required in memory for efficient search, this  
requires 400MB.  This is a lot, but not an unusual amount on todays  
machines.  However, if you have a large number of fields, then this  
approach may be problematic and force you to consider a distributed  
solution earlier than you might otherwise.


400Mb is not a lot in of itself, but when one has many of these types  
of indexes, with many sorting fields with many locales on the same  
host it becomes problematic.  I'm sure there's a point where  
distributing doesn't work over really large collections, because even  
if one partitioned an index across many hosts, one still needs to  
merge sort the results together.


It would be disappointing if Lucene's innate design limited itself to  
10M document collections before needing to consider distributed  
solutions.  10M is not that many.   It would be better if the sorting  
mechanism in Lucene was a little more decoupled such that more  
customised designs could be utilitised for specific scenarios.  Right  
now it's a one-for-all approach without substantial gutting of the code.


cheers,

Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-859) Expose the number of deleted docs in index/segment

2007-04-09 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487640
 ] 

Otis Gospodnetic commented on LUCENE-859:
-

Though it might still be handy to have something with main() that spits out the 
number of deleted documents, as SegmentReader has in my patch.

What do you think about committing just that?  Maybe that should be added to 
the existing IndexReader.main ?

Or maybe it's time to start an app/class in contrib/index that takes various 
command line parameters and prints out information about the index?  If so, 
I'll move that to a new JIRA issue.


> Expose the number of deleted docs in index/segment
> --
>
> Key: LUCENE-859
> URL: https://issues.apache.org/jira/browse/LUCENE-859
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Otis Gospodnetic
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: LUCENE-859
>
>
> Use case:
> We've got a lot of large, mostly search-only indices. These indices are not 
> re-optimized once "deployed".   Docs in them do not get updated, but they do 
> get deleted.  After a while, the number of deleted docs grows, but it's hard 
> to tell how many documents have been deleted.
> Exposing the number of deleted docs via *Reader.deletedDocs() method let's 
> you get to this number.
> I'm attaching  patch that touches the following:
> M  src/test/org/apache/lucene/index/TestSegmentReader.java
> M  src/java/org/apache/lucene/index/MultiReader.java
> M  src/java/org/apache/lucene/index/IndexReader.java
> M  src/java/org/apache/lucene/index/FilterIndexReader.java
> M  src/java/org/apache/lucene/index/ParallelReader.java
> M  src/java/org/apache/lucene/index/SegmentReader.java
> SegmentReader also got a public static main(String[]) that takes 1 
> command-line parameter, a path to the index to check, and prints out the 
> number of deleted docs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Closed: (LUCENE-859) Expose the number of deleted docs in index/segment

2007-04-09 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic closed LUCENE-859.
---

   Resolution: Won't Fix
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Doh, of course! numDocs() looks like this:

  public int numDocs() {
int n = maxDoc();
if (deletedDocs != null)
  n -= deletedDocs.count();
return n;
  }

Won't Fix.

> Expose the number of deleted docs in index/segment
> --
>
> Key: LUCENE-859
> URL: https://issues.apache.org/jira/browse/LUCENE-859
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Otis Gospodnetic
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: LUCENE-859
>
>
> Use case:
> We've got a lot of large, mostly search-only indices. These indices are not 
> re-optimized once "deployed".   Docs in them do not get updated, but they do 
> get deleted.  After a while, the number of deleted docs grows, but it's hard 
> to tell how many documents have been deleted.
> Exposing the number of deleted docs via *Reader.deletedDocs() method let's 
> you get to this number.
> I'm attaching  patch that touches the following:
> M  src/test/org/apache/lucene/index/TestSegmentReader.java
> M  src/java/org/apache/lucene/index/MultiReader.java
> M  src/java/org/apache/lucene/index/IndexReader.java
> M  src/java/org/apache/lucene/index/FilterIndexReader.java
> M  src/java/org/apache/lucene/index/ParallelReader.java
> M  src/java/org/apache/lucene/index/SegmentReader.java
> SegmentReader also got a public static main(String[]) that takes 1 
> command-line parameter, a path to the index to check, and prints out the 
> number of deleted docs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-859) Expose the number of deleted docs in index/segment

2007-04-09 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487635
 ] 

Yonik Seeley commented on LUCENE-859:
-

Isn't this redundant with existing IndexReader methods?

deletedDocs() == maxDoc() - numDocs()

> Expose the number of deleted docs in index/segment
> --
>
> Key: LUCENE-859
> URL: https://issues.apache.org/jira/browse/LUCENE-859
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Otis Gospodnetic
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: LUCENE-859
>
>
> Use case:
> We've got a lot of large, mostly search-only indices. These indices are not 
> re-optimized once "deployed".   Docs in them do not get updated, but they do 
> get deleted.  After a while, the number of deleted docs grows, but it's hard 
> to tell how many documents have been deleted.
> Exposing the number of deleted docs via *Reader.deletedDocs() method let's 
> you get to this number.
> I'm attaching  patch that touches the following:
> M  src/test/org/apache/lucene/index/TestSegmentReader.java
> M  src/java/org/apache/lucene/index/MultiReader.java
> M  src/java/org/apache/lucene/index/IndexReader.java
> M  src/java/org/apache/lucene/index/FilterIndexReader.java
> M  src/java/org/apache/lucene/index/ParallelReader.java
> M  src/java/org/apache/lucene/index/SegmentReader.java
> SegmentReader also got a public static main(String[]) that takes 1 
> command-line parameter, a path to the index to check, and prints out the 
> number of deleted docs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large scale sorting

2007-04-09 Thread jian chen

Hi, Doug,

I have been thinking about this as well lately and have some thoughts
similar to Paul's approach.

Lucene has the norm data for each document field. Conceptually it is a byte
array with one byte for each document field. At query time, I think the norm
array is loaded into memory the first time it is accessed, allowing for
efficient look up of the norm value for each document.

Now, if we could use integers to represent the sort field values, which is
typically the case for most applications, maybe we can afford to have the
sort field values stored in the disk and do disk lookup for each document
matched? The look up of the sort field value will be as simple as docNo * 4
* offset.

This way, we use the same approach as constructing the norms (proper merging
for incremental indexing), but, at search time, we don't load the sort field
values into memory, instead, just store them in disk.

Will this approach be good enough?

Thanks for your feedback.

Jian


On 4/9/07, Doug Cutting <[EMAIL PROTECTED]> wrote:


Paul Smith wrote:
> Disadvantages to this approach:
> * It's a lot more I/O intensive

I think this would be prohibitive.  Queries matching more than a few
hundred documents will take several seconds to sort, since random disk
accesses are required per matching document.  Such an approach is only
practical if you can guarantee that queries match fewer than a hundred
documents, which is not generally the case, especially with large
collections.

> I'm working on the basis that it's a LOT harder/more expensive to simply
> allocate more heap size to cover the current sorting infrastructure.
> One hits memory limits faster.  Not everyone can afford 64-bit hardware
> with many Gb RAM to allocate to a heap.  It _is_ cheaper/easier to build
> a disk subsystem to tune this I/O approach, and one can still use any
> RAM as buffer cache for the memory-mapped file anyway.

In my experience, raw search time starts to climb towards one second per
query as collections grow to around 10M documents (in round figures and
with lots of assumptions).  Thus, searching on a single CPU is less
practical as collections grow substantially larger than 10M documents,
and distributed solutions are required.  So it would be convenient if
sorting is also practical for ~10M document collections on standard
hardware.  If 10M strings with 20 characters are required in memory for
efficient search, this requires 400MB.  This is a lot, but not an
unusual amount on todays machines.  However, if you have a large number
of fields, then this approach may be problematic and force you to
consider a distributed solution earlier than you might otherwise.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-04-09 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487631
 ] 

Otis Gospodnetic commented on LUCENE-584:
-

Doron: just to address your question from Apr/7 - I expect/hope to see an 
improvement in performance because of this difference:

  hc.collect(doc(), score()); 
  mc.collect(doc()); 

the delta being the cost of the score() call that does the scoring.  If I 
understand things correctly, that means that what grant described at the bottom 
of http://lucene.apache.org/java/docs/scoring.html will all be skipped.  No 
Scorer, no BooleanScorer(2), no ConjunctionScorer...


> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, 
> Filter-20060628.patch, HitCollector-20060628.patch, 
> IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, 
> Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, 
> Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, 
> TestSortedVIntList.java
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-859) Expose the number of deleted docs in index/segment

2007-04-09 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-859:


Attachment: LUCENE-859

El patcho.


> Expose the number of deleted docs in index/segment
> --
>
> Key: LUCENE-859
> URL: https://issues.apache.org/jira/browse/LUCENE-859
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Otis Gospodnetic
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: LUCENE-859
>
>
> Use case:
> We've got a lot of large, mostly search-only indices. These indices are not 
> re-optimized once "deployed".   Docs in them do not get updated, but they do 
> get deleted.  After a while, the number of deleted docs grows, but it's hard 
> to tell how many documents have been deleted.
> Exposing the number of deleted docs via *Reader.deletedDocs() method let's 
> you get to this number.
> I'm attaching  patch that touches the following:
> M  src/test/org/apache/lucene/index/TestSegmentReader.java
> M  src/java/org/apache/lucene/index/MultiReader.java
> M  src/java/org/apache/lucene/index/IndexReader.java
> M  src/java/org/apache/lucene/index/FilterIndexReader.java
> M  src/java/org/apache/lucene/index/ParallelReader.java
> M  src/java/org/apache/lucene/index/SegmentReader.java
> SegmentReader also got a public static main(String[]) that takes 1 
> command-line parameter, a path to the index to check, and prints out the 
> number of deleted docs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-859) Expose the number of deleted docs in index/segment

2007-04-09 Thread Otis Gospodnetic (JIRA)
Expose the number of deleted docs in index/segment
--

 Key: LUCENE-859
 URL: https://issues.apache.org/jira/browse/LUCENE-859
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Otis Gospodnetic
 Assigned To: Otis Gospodnetic
Priority: Minor
 Attachments: LUCENE-859

Use case:
We've got a lot of large, mostly search-only indices. These indices are not 
re-optimized once "deployed".   Docs in them do not get updated, but they do 
get deleted.  After a while, the number of deleted docs grows, but it's hard to 
tell how many documents have been deleted.

Exposing the number of deleted docs via *Reader.deletedDocs() method let's you 
get to this number.

I'm attaching  patch that touches the following:

M  src/test/org/apache/lucene/index/TestSegmentReader.java
M  src/java/org/apache/lucene/index/MultiReader.java
M  src/java/org/apache/lucene/index/IndexReader.java
M  src/java/org/apache/lucene/index/FilterIndexReader.java
M  src/java/org/apache/lucene/index/ParallelReader.java
M  src/java/org/apache/lucene/index/SegmentReader.java

SegmentReader also got a public static main(String[]) that takes 1 command-line 
parameter, a path to the index to check, and prints out the number of deleted 
docs.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

2007-04-09 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487617
 ] 

Doron Cohen commented on LUCENE-848:


Seems okay to me (since it's all in the benchmark).

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> 
>
> Key: LUCENE-848
> URL: https://issues.apache.org/jira/browse/LUCENE-848
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Steven Parkes
> Assigned To: Steven Parkes
>Priority: Minor
> Fix For: 2.2
>
> Attachments: LUCENE-848.txt, WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-04-09 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487616
 ] 

Doron Cohen commented on LUCENE-584:


> > When you rerun, you may want to use my alg - to compare the two approaches 
> > in one run. 
> This is more dangerous though. 

Agree. I was trying to get rid of this by splitting each round to 3: - gc(), 
warm(), work() - when work() and warm() are the same, just that warm()'s stats 
are disregarded. Still switching the order of "by match" and "by bits" yield 
different results. 

Sometimes we would like not to disregard GC - in particular if one approach is 
creating more (or more complex) garbage than another approach. 

Perhaps we should look at two measures: best & avg/sum (2nd ignoring first run, 
for hotspot). 


> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, 
> Filter-20060628.patch, HitCollector-20060628.patch, 
> IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, 
> Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, 
> Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, 
> TestSortedVIntList.java
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-04-09 Thread Mike Klaas (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487613
 ] 

Mike Klaas commented on LUCENE-584:
---

Instead of discarding the first run, the approach I usually take is to run 3-4 
times and pick the minimum.  You can then run several of these "sets" and 
average over the minimum of each.  GC is still an issues, though.  It is hard 
to get around when it is a mark&sweep collector (reference counting is much 
friendlier in this regard)

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, 
> Filter-20060628.patch, HitCollector-20060628.patch, 
> IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, 
> Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, 
> Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, 
> TestSortedVIntList.java
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

2007-04-09 Thread Steven Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487609
 ] 

Steven Parkes commented on LUCENE-848:
--

That's what I meant (and did).

If it's okay, I'll bundle it into 848. 



> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> 
>
> Key: LUCENE-848
> URL: https://issues.apache.org/jira/browse/LUCENE-848
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Steven Parkes
> Assigned To: Steven Parkes
>Priority: Minor
> Fix For: 2.2
>
> Attachments: LUCENE-848.txt, WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

2007-04-09 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487608
 ] 

Doron Cohen commented on LUCENE-848:


> Also, I was going to add support to the algorithm format for setting max 
> field length ... 

If this means extending the algorithm language, it would be simpler to just 
base on a property here - in the alg file set that property - 
"max.field.length=2" - and then in OpenIndexTask read that new property 
(see how merge.factor property is read) and set it on the index. 


> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> 
>
> Key: LUCENE-848
> URL: https://issues.apache.org/jira/browse/LUCENE-848
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Steven Parkes
> Assigned To: Steven Parkes
>Priority: Minor
> Fix For: 2.2
>
> Attachments: LUCENE-848.txt, WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

2007-04-09 Thread Steven Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487600
 ] 

Steven Parkes commented on LUCENE-848:
--

By the way, that's a rough patch. I'm cleaning it up as I use it to test 847.

Also, I was going to add support to the algorithm format for setting max field 
length ...

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> 
>
> Key: LUCENE-848
> URL: https://issues.apache.org/jira/browse/LUCENE-848
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Steven Parkes
> Assigned To: Steven Parkes
>Priority: Minor
> Fix For: 2.2
>
> Attachments: LUCENE-848.txt, WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-09 Thread Andy Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Liu updated LUCENE-855:


Attachment: TestRangeFilterPerformanceComparison.java

Here's my new benchmark.

> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Assigned To: Otis Gospodnetic
> Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-09 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487595
 ] 

Andy Liu commented on LUCENE-855:
-

In your updated benchmark, you're combining the range filter with a term query 
that matches one document.  I don't believe that's the typical use case for a 
range filter.  Usually the user employs a range to filter a large document set. 
 

I created a different benchmark to compare standard range filter, 
MemoryCachedRangeFilter, and Matt's FieldCacheRangeFilter using 
MatchAllDocsQuery, ConstantScoreQuery, and TermQuery (matching one doc like the 
last benchmark).  Here are the results:

Reader opened with 10 documents.  Creating RangeFilters...
RangeFilter w/MatchAllDocsQuery:

  * Bits: 4421
  * Search: 5285

RangeFilter w/ConstantScoreQuery:

  * Bits: 4200
  * Search: 8694

RangeFilter w/TermQuery:

  * Bits: 4088
  * Search: 4133

MemoryCachedRangeFilter w/MatchAllDocsQuery:

  * Bits: 80
  * Search: 1142

MemoryCachedRangeFilter w/ConstantScoreQuery:

  * Bits: 79
  * Search: 482

MemoryCachedRangeFilter w/TermQuery:

  * Bits: 73
  * Search: 95

FieldCacheRangeFilter w/MatchAllDocsQuery:

  * Bits: 0
  * Search: 1146

FieldCacheRangeFilter w/ConstantScoreQuery:

  * Bits: 1
  * Search: 356

FieldCacheRangeFilter w/TermQuery:

  * Bits: 0
  * Search: 19

Here's some points:

1. When searching in a filter, bits() is called, so the search time includes 
bits() time.
2. Matt's FieldCacheRangeFilter is faster for ConstantScoreQuery, although not 
by much.  Using MatchAllDocsQuery, they run neck-and-neck.  FCRF is much faster 
for TermQuery since MCRF has to create the BItSet for the range before the 
search is executed.
3. I get less document hits when running FieldCacheRangeFilter with 
ConstantScoreQuery.  Matt, there may be a bug in getNextSetBit().  Not sure if 
this would affect the benchmark.
4. I'd be interested to see performance numbers when FieldCacheRangeFilter is 
used with ChainedFilter.  I suspect that MCRF would be faster in this case, 
since I'm assuming that FCRF has to reconstruct a standard BitSet during 
clone().

> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Assigned To: Otis Gospodnetic
> Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
>

[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-04-09 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487594
 ] 

Yonik Seeley commented on LUCENE-584:
-

> When you rerun, you may want to use my alg - to compare the two approaches in 
> one run.

This is more dangerous though.  GC from one method's garbage can penalize the 
2nd methods performance.
Also, hotspot effects are hard to account for (if method1 and method2 use 
common methods, method2 will often execute faster than method one because more 
optimization has been done on those common methods).

The hotspot effect can be minimized by running the test multiple times in the 
same JVM instance and discarding the first runs, but it's not so easy for GC.

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, 
> Filter-20060628.patch, HitCollector-20060628.patch, 
> IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, 
> Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, 
> Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, 
> TestSortedVIntList.java
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

2007-04-09 Thread Steven Parkes (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Parkes updated LUCENE-848:
-

Attachment: LUCENE-848.txt

This patch is a first cut a wikipedia benchmark support. It downloads the 
current english pages from the Wikipedia download site ... which, of course, is 
actually not there right now. I'm not quite sure what's up, but you can find 
the files at http://download.wikimedia.org/enwiki/20070402/ right now if you 
want to play.

It adds ExtractWikipedia.java, which uses Xerces-J to grab the individual 
articles. It writes the articles in the same format as the Reuters stuff, so a 
generecised ReutersDocMaker, DirDocMaker, works.

The current size of the download file is 2.1G bzip2'd. It's supposed to contain 
about 1.2M documents but I came out with 2 or 3, I think, so there maybe 
"extra" files in there. (Some entries are links and I tried to get rid of 
those, but I may have missed a particular coding or case).

For the first pass, I copied the Reuters steps of decompressing and parsing. 
This creates big temporary files. Moreover, it creates a big directory tree in 
the end. (The extractor uses a fixed number of documents per directory and 
grows the depth of the tree logarithmically, a lot like Lucene segments).

It's not clear how this preprocessing-to-a-directory-tree compares to on the 
fly decompression, which would require less disk seeks on the input during 
indexing. May try that at some point ...

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> 
>
> Key: LUCENE-848
> URL: https://issues.apache.org/jira/browse/LUCENE-848
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/benchmark
>Reporter: Steven Parkes
> Assigned To: Steven Parkes
>Priority: Minor
> Fix For: 2.2
>
> Attachments: LUCENE-848.txt, WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-09 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic reassigned LUCENE-855:
---

Assignee: Otis Gospodnetic

> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Assigned To: Otis Gospodnetic
> Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-09 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487590
 ] 

Otis Gospodnetic commented on LUCENE-855:
-

Comments about the patch so far:
Cosmetics:
- You don't want to refer to Andy's class in javadocs, as that class won't go 
in unless Andy makes it faster.
- I see some incorrect (copy/paste error) javadocs and javadocs/comments with 
typos in both the test classes and non-test classes.
- Please configure your Lucene project in Eclipse to use 2 spaces instead of 4. 
 In general, once you get the code formatting settings right, it's a good 
practise to format your code with that setting before submitting a patch.

Testing:
- You can put the testPerformance() code from  
TestFieldCacheRangeFilterPerformance  in the other unit test class you have 
there.
- Your testPerformance() doesn't actually assert...() anything, just prints out 
numbers to stdout.  You can keep the printing, but it would be better to also 
do some asserts, so we can always test that the FCRangerFilter beats the 
vanilla RangeFilter without looking at the stdout.
- You may want to close that searcher in testPerformance() before opening a new 
one.  Probably won't make any difference, but still.
- You may also want to just close the searcher at the end of the method.


Impl:
- In the inner FieldCacheBitSet class, I see:
+public boolean intersects(BitSet set)  {
+for (int i =0; i < length; i++) {
+if (get(i) && set.get(i)) {
+return true;
+}
+}
+return false;
+}

Is there room for a small optimization?  What if BitSets are not of equal size? 
 Wouldn't it make sense to loop through a smaller BitSet then?  Sorry if I'm 
off, I hardly ever work with BitSets.

- I see you made *_PARSERs in FCImpl public (were private).  Is that really 
needed?  Would ackage protected be enough?

- Make sure ASL is in all test and non-test classes, I don't see it there now.


Overall, I like it - slick and elegant usage of FC!

I'd love to know what Hoss and other big Filter users think about this.  Solr 
makes a lof of use of (Range?)Filters, I believe.


> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically 

Re: optimize() method call

2007-04-09 Thread Doug Cutting

Otis Gospodnetic wrote:

I'd advise against calling optimize() at all in an environment whose indices 
are constantly updated.


+1

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large scale sorting

2007-04-09 Thread Doug Cutting

Paul Smith wrote:

Disadvantages to this approach:
* It's a lot more I/O intensive


I think this would be prohibitive.  Queries matching more than a few 
hundred documents will take several seconds to sort, since random disk 
accesses are required per matching document.  Such an approach is only 
practical if you can guarantee that queries match fewer than a hundred 
documents, which is not generally the case, especially with large 
collections.


I'm working on the basis that it's a LOT harder/more expensive to simply 
allocate more heap size to cover the current sorting infrastructure.   
One hits memory limits faster.  Not everyone can afford 64-bit hardware 
with many Gb RAM to allocate to a heap.  It _is_ cheaper/easier to build 
a disk subsystem to tune this I/O approach, and one can still use any 
RAM as buffer cache for the memory-mapped file anyway.


In my experience, raw search time starts to climb towards one second per 
query as collections grow to around 10M documents (in round figures and 
with lots of assumptions).  Thus, searching on a single CPU is less 
practical as collections grow substantially larger than 10M documents, 
and distributed solutions are required.  So it would be convenient if 
sorting is also practical for ~10M document collections on standard 
hardware.  If 10M strings with 20 characters are required in memory for 
efficient search, this requires 400MB.  This is a lot, but not an 
unusual amount on todays machines.  However, if you have a large number 
of fields, then this approach may be problematic and force you to 
consider a distributed solution earlier than you might otherwise.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Progressive Query Relaxation

2007-04-09 Thread J. Delgado

The idea is to efficiently get the desired result set (top N) at once
without having to re-run different queries inside the application
logic. Query relaxation avoids having several round trips and possibly
could be offered with and without deduplication. Maybe this is a
feature required for Solr rather than for Lucene.

Question: Even if lucene's score is not absolute does it somewhat
determine an partial order among results of different queries?

J.D.

2007/4/9, Otis Gospodnetic <[EMAIL PROTECTED]>:

Not that I know of.  One typically puts that in application logic and re-runs or offers 
to run alternative queries.  No de-duping there, unless you do it in your app.  I think 
one problem with the described approach and Lucene would be that Lucene's scores are not 
"absolute".

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: J. Delgado <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org; solr-dev@lucene.apache.org
Sent: Monday, April 9, 2007 3:46:40 AM
Subject: Progressive Query Relaxation

Has anyone within the Lucene or Solr community attempted to code a
progressive query relaxation technique similar to the one described
here for Oracle Text?
http://www.oracle.com/technology/products/text/htdocs/prog_relax.html

Thanks,

-- J.D.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-09 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487587
 ] 

Otis Gospodnetic commented on LUCENE-855:
-

OK.  I'll wait for the new performance numbers before committing.  Andy, if you 
see anything funky in Matt's patch or if you managed to make your version 
faster, let us know, please.


> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-853) Caching does not work when using RMI

2007-04-09 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-853.
-

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Committed, thanks Matt.

> Caching does not work when using RMI
> 
>
> Key: LUCENE-853
> URL: https://issues.apache.org/jira/browse/LUCENE-853
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.1
> Environment: All 
>Reporter: Matt Ericson
>Priority: Minor
> Attachments: RemoteCachingWrapperFilter.patch, 
> RemoteCachingWrapperFilter.patch, RemoteCachingWrapperFilter.patch, 
> RemoteCachingWrapperFilter.patch .patch
>
>
> Filters and caching uses transient maps so that caching does not work if you 
> are using RMI and a remote searcher 
> I want to add a new RemoteCachededFilter that will make sure that the caching 
> is done on the remote searcher side 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-09 Thread Matt Ericson (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Ericson updated LUCENE-855:


Attachment: FieldCacheRangeFilter.patch

This version will create a real BitSet() when cloned and will allow chained 
filter to work correctly 



> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Progressive Query Relaxation

2007-04-09 Thread Otis Gospodnetic
Not that I know of.  One typically puts that in application logic and re-runs 
or offers to run alternative queries.  No de-duping there, unless you do it in 
your app.  I think one problem with the described approach and Lucene would be 
that Lucene's scores are not "absolute".

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: J. Delgado <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org; solr-dev@lucene.apache.org
Sent: Monday, April 9, 2007 3:46:40 AM
Subject: Progressive Query Relaxation

Has anyone within the Lucene or Solr community attempted to code a
progressive query relaxation technique similar to the one described
here for Oracle Text?
http://www.oracle.com/technology/products/text/htdocs/prog_relax.html

Thanks,

-- J.D.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-858) link from Lucene web page to API docs

2007-04-09 Thread Daniel Naber (JIRA)
link from Lucene web page to API docs
-

 Key: LUCENE-858
 URL: https://issues.apache.org/jira/browse/LUCENE-858
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Daniel Naber
 Assigned To: Grant Ingersoll


There should be a way to link from e.g. 
http://lucene.apache.org/java/docs/gettingstarted.html to the API docs, but not 
just to the start page with the frame set but to a specific page, e.g. this:

http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/overview-summary.html#overview_description

To make this work a way to set a relative link is needed.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: linking the API docs

2007-04-09 Thread Grant Ingersoll

Hi Daniel,

Can you file this as an issue and assign it to me?  Nigel and I are  
working through a few things w/ Hudson and the docs, still.  The gist  
of it is that the API and website will be put back on people.a.o.   
This will mean that a relative link like api/overview- 
summary.html#overview_description should be sufficient.


Thanks,
Grant

On Apr 7, 2007, at 4:01 PM, Daniel Naber wrote:


On Saturday 07 April 2007 00:42, Chris Hostetter wrote:


: I think you can put in the link, just use relative link like in the
: site.xml.

using a relative link is *key* ... it ensures not only that the  
static

files build by the nightly build work, but also that the docs
distributed with each release contain good local pointers.


I'm not familiar with forrest, could you help me setting the link?

The pages to be linked are these:
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ 
javadoc/overview-summary.html#overview_description
http://lucene.apache.org/java/2_1_0/api/overview- 
summary.html#overview_description

(etc)

Note that this is not the API docs page (which contains the  
frameset) but a
content page plus an anchor. So I cannot use href="ext:javadocs"> but
  
doesn't

work either.

Regards
 Daniel

--
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Progressive Query Relaxation

2007-04-09 Thread J. Delgado

Has anyone within the Lucene or Solr community attempted to code a
progressive query relaxation technique similar to the one described
here for Oracle Text?
http://www.oracle.com/technology/products/text/htdocs/prog_relax.html

Thanks,

-- J.D.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]