SpanNotQuery.hashCode cut/paste error?

2006-05-16 Thread Chris Hostetter

SpanNodeQuery's hashCode method makes two refrences to include.hashCode(),
but none to exclude.hashCode() ... this is a mistake yes/no?



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-569) NearSpans skipTo bug

2006-05-16 Thread paul.elschot (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-569?page=comments#action_12411904 ] 

paul.elschot commented on LUCENE-569:
-

> I tried to make sense of the existing NearSpans implimentation over the 
> weekend ... i did not succeed.
> I still haven't had a cahnce to look at the new one in LUCENE-413 but i wnat 
> to clarify something you said.. 

For the unordered case the priority queue implementation over the subspans in 
the current NearSpans is fine.
For the ordered case I could not figure out how to deal with the priority queue 
and the restriction on
ordering at the same time. This is precisely what the bug above shows.
 
> >>> The NearSpansOrdered there differs from the current version in that it 
> >>> does not 
> >>> match overlapping subspans, but it passes all current test cases 
> >>> including TestNearSpans here. 
>
> ...should I understand you to mean then that the current implimentaion of 
> NearSpans does work
> correctly with overlapping sub-spans ... there just isnt' a test for it? 

For ordered queries, it might work with overlapping sub-spans on some cases.
However, I'd expect any test to run into the bug above for some other ordered 
cases.
 
> that seems like important enough behavior that we wouldn't want to break it 
> to fix this bug. 

Given the bug, I hope nothing depends on it.

> Even if matching on overlapping subspans wasn't an intentional feature of 
> NearSpans -- the fact that it
> currently works and the documentation is silent on the issue suggests to me 
> that it should remain supported. 

That can probably be done by modifying the NearSpansOrdered of LUCENE-413 at 
lines 133-138 and at
line 167 where the end of the previous (possibly matching) subspans is compared 
to the start of the next one.
This could compare the start with the start instead.
I don't know what precisely is the intended behaviour, so I can't say whether 
these changed comparisons
should allow equality or not. Perhaps the ends should be compared when the 
starts are equal,
just like it is done in the priority queue for the unordered case.



> NearSpans skipTo bug
> 
>
>  Key: LUCENE-569
>  URL: http://issues.apache.org/jira/browse/LUCENE-569
>  Project: Lucene - Java
> Type: Bug

>   Components: Search
> Reporter: Hoss Man
>  Attachments: TestNearSpans.java
>
> NearSpans appears to have a bug in skipTo that causes it to skip over some 
> matching documents completely.  I discovered this bug while investigating 
> problems with SpanWeight.explain, but as far as I can tell the Bug is not 
> specific to Explanations ... it seems like it could potentially result in 
> incorrect matching in some situations where a SpanNearQuery is nested in 
> another query such thatskipTo will be used ... I tried to create a high level 
> test case to exploit the bug when searching, but i could not.  TestCase 
> exploiting the class using NearSpan and SpanScorer will follow...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: OpenBitSet

2006-05-16 Thread eks dev

>Weird... I'm not sure how that could be.  Are you sure you didn't get
>the numbers reversed?

that is exactly what happend, sorry for wrong numbers,  now it looks as it 
should:

java -version
Java(TM) SE Runtime Environment (build 1.6.0-beta2-b83)
Java HotSpot(TM) Client VM (build 1.6.0-beta2-b83, mixed mode, sharing)

java -server -Xbatch org.apache.solr.util.BitSetPerf 100 50 1 union 
3000 bit
ret=0
TIME=21966

java -server -Xbatch org.apache.solr.util.BitSetPerf 100 50 1 union 
3000 open
ret=0
TIME=19832

I measured also on different densities, and it looks about the same. When I 
find a few spare minutes will make one PerfTest that generates gnuplot 
diagrams. Wold be interesting to see how all key methods behave as a function 
of density/size. 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Jira Convention: Resolved vs Closed

2006-05-16 Thread Erik Hatcher
I've historically treated Closed and Resolved as the same thing and  
have closed resolved issues just to set them to that state.


Erik

On May 15, 2006, at 9:24 PM, Chris Hostetter wrote:



Is there a documented or unspoken policy about the "Resolved" vs  
"Closed"

bug statuses?

How/when should a resolved bug be closed?

(In my experience policy has tended towards the person fixing the  
bug to
"resolve" it, and the person who opened the bug to "close" once  
they're
verified the fix -- but that's not really possible with the way the  
Lucene
Jira project is setup, since anyone can open a bug, but only  
developers

can close them)

-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SpanNotQuery.hashCode cut/paste error?

2006-05-16 Thread Erik Hatcher
Yes, this is a mistake.  I'm happy to fix it, but looks like you have  
other patches in progress.


Erik

On May 16, 2006, at 3:33 AM, Chris Hostetter wrote:



SpanNodeQuery's hashCode method makes two refrences to  
include.hashCode(),

but none to exclude.hashCode() ... this is a mistake yes/no?



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Nio File Caching & Performance Test

2006-05-16 Thread Robert Engels
My tests still hold that the NioFile I submitted is significantly faster
than the standard FSDirectory.

BUT, the memory mapped implementation is significantly faster than NioFile.
I attribute this to the overhead of managing the soft references, and
possible GC interaction.

SO, I would like to use a memory mapped reader, but I encounter OOM errors
when mapping large files, due to running out of address space.

Has anyone found a solution for this? (A 2 gig index is not all that
large...).

-Original Message-
From: Murat Yakici [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 16, 2006 1:55 AM
To: java-dev@lucene.apache.org
Subject: Re: Nio File Caching & Performance Test


Hi,

According to my humble tests, there is no significant improvement
either. NIO has buffer creation time costs compared to other Buffered
IOs. However, a testbed would be ideal for benchmarks.

Murat

Doug Cutting wrote:

> Robert Engels wrote:
>
>> The most important statistic is that the reading via the local cache, vs.
>> going to the OS (where the block is cached) is 3x faster (22344 vs.
>> 68578).
>> With random reads, when the block may not be in the OS cache, it is 8x
>> faster (72766 vs. 586391).
>
> [ ... ]
>
>> This test only demonstrates improvements in the low-level IO layer,
>> but one
>> could infer significant performance improvements for common searches
>> and/or
>> document retrievals.
>
>
> That is not an inference I would make.  There should be some
> improvement, but whether it is significant is not clear to me.
>
>> Is there a standard Lucene search performance I could run both with and
>> without the NioFSDirectory to demonstrate real world performance
>> improvements? I have some internal tests that I am collating, but I would
>> rather use a standard test if possible.
>
>
> No, we don't have a standard benchmark suite.  Folks have talked about
> developing one, but I don't think one yet exists.
>
> Report what you have.  Describe the collection, how it is indexed, how
> you've selected queries, and the improvement in average response time.
>
> Doug
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Nio File Caching & Performance Test

2006-05-16 Thread Doug Cutting

Robert Engels wrote:

SO, I would like to use a memory mapped reader, but I encounter OOM errors
when mapping large files, due to running out of address space.

Has anyone found a solution for this? (A 2 gig index is not all that
large...).


A 64-bit hardware, OS and JVM solves this nicely.  On 32-bit systems it 
is hard for the OS to allocate the large, contiguous regions of address 
space required to memory map a 2GB index.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Nio File Caching & Performance Test

2006-05-16 Thread Yonik Seeley

On 5/16/06, Robert Engels <[EMAIL PROTECTED]> wrote:

SO, I would like to use a memory mapped reader, but I encounter OOM errors
when mapping large files, due to running out of address space.


Pretty much all x86 servers sold are 64 bit capable now.
Run a 64 bit OS if you can :-)


Has anyone found a solution for this? (A 2 gig index is not all that
large...).


I guess one could try a hybrid approach... only mmap certain index
files that are critical for performance.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Jira Convention: Resolved vs Closed

2006-05-16 Thread Doug Cutting

Chris Hostetter wrote:

How/when should a resolved bug be closed?


I close bugs after their "fix version" is released.

The distinction between "resolved" and "closed" is intended for projects 
with a formal QA process.  An engineer fixes a bug and marks it 
"resolved", and then a tester verifies the test and either closes it or 
re-opens it if it has not been fixed.  In our case, we're all testers.


Released, closed issues should generally not be re-opened.  If there are 
further problems related to an issue after a release is made and the 
issue has been closed, then a new issue should be created.  Why?  If a 
project has a Jira issues for every commit, and includes the issue name 
in the commit message, then Jira's change log fully can fully document 
the release, including links to subversion diffs, etc.  But re-opening a 
closed bug messes this up.  It's better to add a new bug that links to 
the old, closed bug.


We're trying to operate this way on Hadoop.  Issues are entered for most 
planned changes and assigned a "fix release".  Then Jira's "road map" 
feature can be used to see what features are planned for various 
upcoming releases.  This isn't perfect, since issues dropped for one 
release are pushed to the next, and the list of issues per release 
becomes unrealistically large (at least for the monthly release schedule 
we're on).  But on Hadoop we currently have dedicated resources who can 
be assigned bugs and will work hard to fix them by a release date.  I'm 
not sure whether this would work on Lucene, which currently lacks such 
dedicated resources, but it might be interesting to try.



(In my experience policy has tended towards the person fixing the bug to
"resolve" it, and the person who opened the bug to "close" once they're
verified the fix -- but that's not really possible with the way the Lucene
Jira project is setup, since anyone can open a bug, but only developers
can close them)


Note that I think it's okay to add folks the the "lucene-developers" 
Jira group who are not Lucene committers.  Some folks are very involved 
with Lucene, but don't submit so many patches that they need to be a 
committer.  For such people it can make sense to have them able to help 
manage Jira issues.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Phrase IDF and collection frequency !

2006-05-16 Thread ABDOU Samir
Hi,
 
Are there any ideas on how to compute the "document frequency" and "collection 
frequency" of phrases?
 
Document frequency is the number of documents containing the phrase.
 
Collection frequency is the frequency of the phrase in the whole collection.
 
 
Thanks in advance for any help
 
Samir
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



FieldsReader synchronized access vs. ThreadLocal ?

2006-05-16 Thread Robert Engels
In SegmentReader, currently the access to FieldsReader.doc(n) is
synchronized (which is must be).

Does it not make sense to use a ThreadLocal implementation similar to the
TermInfosReader?

It seems that in a highly multi-threaded server this synchronized method
could lead to significant blocking when the documents are being retrieved?


Re: OpenBitSet

2006-05-16 Thread Chris Hostetter

: I measured also on different densities, and it looks about the same.
: When I find a few spare minutes will make one PerfTest that generates
: gnuplot diagrams. Wold be interesting to see how all key methods behave
: as a function of density/size.

I was thinking the same thing ... i just haven't had time to play with it.

It migh also be usefull to check how the distribution of the set bits
affects things -- i suspect that for some "Filters" there some amount of
clustering as many people index their documents in a particular order, and
then filter on ranges of that order (ie: index documents as they are
created, and then filtering on create date) ... using
Random.nextGaussian() to pick which bets to set might be interesting.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



query question

2006-05-16 Thread Dedian Guo

I am not sure if it is a question, could anybody tell me if the query syntax
can do the select...from..where job as in traditional database? I have
checked Lucene query syntax, but seems a little bit not too complex as
SQL...correct me if wrong, or there is no such requirement for searching
engine?

best,

Dedian


Re: OpenBitSet

2006-05-16 Thread eks dev
Yeah, good hint. We actually made such measurements on TreeIntegerSet 
implementation, and it is totally astonishing what you get as a result (I 
remember 6Meg against 2k Memory consumption for "predominantly sorted bit 
vectors" like zip codes, conjuction/disjunct speed oreder of magnitude faster 
as it walks shallow tree in that case). If you have any posibility to sort your 
indexes, do so, even Lucene on disk representation appreciates this I guess 
(skips are faster, bit vectors on disk better compressed/decompresed?) 
 
We even made one small visualizer of bit vectors that visualizes (generates 
image) HitCollector results for any specified query (gray image where every 
pixel represents 8-32 succesive bits from bit vector higher density=>darker 
color ). I like to see the enemy first.  
 
When we are allready in this area, just a curiosity,  friend of mine has one 
head spinning idea, to utilize graphics card HW to do super fast bit vector 
operations.  These thingies today are really optimized for basic bit 
operations. I am just curious to see what he comes up with. 
 
I hope I will have some time next week or so to polish some tests for 
OpenBitSet a bit and drop it somewhere on Jira if anybody has interest to play 
with.

A bit off  topic, is there anybody who is doing ChainedFilter version that uses 
docNrSkipper? As I recall, you wrote BitSet version :)
 
- Original Message 
From: Chris Hostetter <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org; eks dev <[EMAIL PROTECTED]>
Sent: Tuesday, 16 May, 2006 8:13:53 PM
Subject: Re: OpenBitSet


: I measured also on different densities, and it looks about the same.
: When I find a few spare minutes will make one PerfTest that generates
: gnuplot diagrams. Wold be interesting to see how all key methods behave
: as a function of density/size.

I was thinking the same thing ... i just haven't had time to play with it.

It migh also be usefull to check how the distribution of the set bits
affects things -- i suspect that for some "Filters" there some amount of
clustering as many people index their documents in a particular order, and
then filter on ranges of that order (ie: index documents as they are
created, and then filtering on create date) ... using
Random.nextGaussian() to pick which bets to set might be interesting.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Nio File Caching & Performance Test

2006-05-16 Thread eks dev
Hi Robert,
I might be easily wrong, but I beleive I saw something on JIRA (or was it 
bugzilla?) a long long time ago, where somebody made MMAP implementation for 
really big indexes that works on 32 bit. I guess it is worth checking it.


- Original Message 
From: Yonik Seeley <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Sent: Tuesday, 16 May, 2006 6:10:07 PM
Subject: Re: Nio File Caching & Performance Test


On 5/16/06, Robert Engels <[EMAIL PROTECTED]> wrote:
> SO, I would like to use a memory mapped reader, but I encounter OOM errors
> when mapping large files, due to running out of address space.

Pretty much all x86 servers sold are 64 bit capable now.
Run a 64 bit OS if you can :-)

> Has anyone found a solution for this? (A 2 gig index is not all that
> large...).

I guess one could try a hybrid approach... only mmap certain index
files that are critical for performance.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



java.lang.IndexOutOfBoundsException when querying Lucene

2006-05-16 Thread Alexandru Popescu

Hi!

I am having quite a complex query that gets executed against the JCR
content (that used Lucene for indexing/searching). From time to time I
am seeing this exception:

[trace]
java.lang.IndexOutOfBoundsException: Index: 99, Size: 27
  at java.util.ArrayList.RangeCheck(ArrayList.java:546)
  at java.util.ArrayList.get(ArrayList.java:321)
  at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155)
  at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:66)
  at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:237)
  at 
org.apache.lucene.index.FilterIndexReader.document(FilterIndexReader.java:103)
  at 
org.apache.lucene.index.FilterIndexReader.document(FilterIndexReader.java:103)
  at 
org.apache.lucene.index.FilterIndexReader.document(FilterIndexReader.java:103)
  at org.apache.lucene.index.MultiReader.document(MultiReader.java:108)
  at 
org.apache.jackrabbit.core.query.lucene.ChildAxisQuery$ChildAxisScorer.calculateChildren(ChildAxisQuery.java:308)
  at 
org.apache.jackrabbit.core.query.lucene.ChildAxisQuery$ChildAxisScorer.next(ChildAxisQuery.java:250)
  at 
org.apache.lucene.search.ConjunctionScorer.init(ConjunctionScorer.java:87)
  at 
org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java:44)
  at org.apache.lucene.search.Scorer.score(Scorer.java:37)
  at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:121)
  at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
  at org.apache.lucene.search.Hits.(Hits.java:51)
  at org.apache.lucene.search.Searcher.search(Searcher.java:41)
  at 
org.apache.jackrabbit.core.query.lucene.SearchIndex.executeQuery(SearchIndex.java:374)
  at 
org.apache.jackrabbit.core.query.lucene.QueryImpl.execute(QueryImpl.java:174)
  at org.apache.jackrabbit.core.query.QueryImpl.execute(QueryImpl.java:130)
[/trace]

I really don't have any idea why this is happening. Do you have any
pointers? I would like to understand what may go wrong so that I can
prevent at least in my application (that is based on Jackrabbit JCR
implementation, and so on Lucene) that this occurs (or at least I can
reliable catch the exception and understand what I have to do when it
occurs).

The ML contains a couple of reported IndexOutOfBoundsException reports
but all of them are about index merging. Same on JIRA.

Any help, ideas, hints are highly appreciated,
thanks very much in advance,

./alex
--
.w( the_mindstorm )p.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Nio File Caching & Performance Test

2006-05-16 Thread Robert Engels
The MMapDirectory works for really big indexes (larger than 2 gig), BUT if
the JVM does not have enough address space (32 bit JVM)it will not work.

-Original Message-
From: eks dev [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 16, 2006 2:20 PM
To: java-dev@lucene.apache.org
Subject: Re: Nio File Caching & Performance Test


Hi Robert,
I might be easily wrong, but I beleive I saw something on JIRA (or was it
bugzilla?) a long long time ago, where somebody made MMAP implementation for
really big indexes that works on 32 bit. I guess it is worth checking it.


- Original Message 
From: Yonik Seeley <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Sent: Tuesday, 16 May, 2006 6:10:07 PM
Subject: Re: Nio File Caching & Performance Test


On 5/16/06, Robert Engels <[EMAIL PROTECTED]> wrote:
> SO, I would like to use a memory mapped reader, but I encounter OOM errors
> when mapping large files, due to running out of address space.

Pretty much all x86 servers sold are 64 bit capable now.
Run a 64 bit OS if you can :-)

> Has anyone found a solution for this? (A 2 gig index is not all that
> large...).

I guess one could try a hybrid approach... only mmap certain index
files that are critical for performance.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



non indexed field searching?

2006-05-16 Thread Robert Engels
I know I've (and others have brought this up before), but maybe now with the
lazy field loading (seemingly due to larger documents being stored) it is
time to revisit.

It seems that maybe a query could be separated into Filter and Query clauses
(similar to how the query optimizer works in Nutch). Clauses that were based
on non-indexed fields would be converted to a Filter.

The problem is if you have some thing like

(indexed:somevalue OR nonindexed:somevalue)

would require a complete visit to every document.

But something like

(indexed:somevalue AND nonindexed:somevalue)

would be very efficient.

I understand that this is moving Lucene closer to a database, but it is just
very difficult to perform some complex queries efficiently without it.

*** As an aside, I still don't understand why Filter is not an interface

interface Filter {
boolean include(IndexReader reader,int doc)
}

and then you would have

NonIndexedFilter(String fieldname,String expression) implements Filter
boolean include(IndexReader reader,int doc) {
Document d = reader.document(doc);
String val = d.getValue(fieldname);
return {evaluate expression against val};
}

Filter being an interface should incur very little overhead in the common
case where it was backed by a BitSet as the modern JVM will inline it.


Re: FieldsReader synchronized access vs. ThreadLocal ?

2006-05-16 Thread Doug Cutting

Robert Engels wrote:

It seems that in a highly multi-threaded server this synchronized method
could lead to significant blocking when the documents are being retrieved?


Perhaps, but I'd prefer to wait for someone to demonstrate this as a 
performance bottleneck before adding another ThreadLocal.


Peter Keegan has recently demonstrated pretty good concurrency using 
mmap directory on four and eight CPU systems:


http://www.mail-archive.com/java-user@lucene.apache.org/msg05074.html

Peter also wondered if the SegmentReader.document(int) method might be a 
bottleneck, and tried patching it to run unsynchronized:


http://www.mail-archive.com/java-user@lucene.apache.org/msg05891.html

Unfortunately that did not improve his performance:

http://www.mail-archive.com/java-user@lucene.apache.org/msg06163.html

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: non indexed field searching?

2006-05-16 Thread Erik Hatcher


On May 16, 2006, at 3:37 PM, Robert Engels wrote:
It seems that maybe a query could be separated into Filter and  
Query clauses
(similar to how the query optimizer works in Nutch). Clauses that  
were based

on non-indexed fields would be converted to a Filter.

The problem is if you have some thing like

(indexed:somevalue OR nonindexed:somevalue)

would require a complete visit to every document.


Not necessarily.  A query optimizer could could extract these term  
query clauses, look up cached doc sets (bit sets) and union them.   
Scoring is the trickier part - I'm now curious to dig into Solr and  
see how it handles this.


I understand that this is moving Lucene closer to a database, but  
it is just

very difficult to perform some complex queries efficiently without it.


Check out Solr - I think you'll find it fits this niche nicely.

*** As an aside, I still don't understand why Filter is not an  
interface


I saw that Paul Elschot has just done some refactoring work attached  
to a JIRA issue on this very topic.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Phrase IDF and collection frequency !

2006-05-16 Thread Tatu Saloranta
--- ABDOU Samir <[EMAIL PROTECTED]> wrote:

> Hi,
>  
> Are there any ideas on how to compute the "document
> frequency" and "collection frequency" of phrases?

Tokenize your input as phrases (instead of words), and
you'll get this the same way you normally get stats
for single-word tokens (Terms)? I did that for bigram
frequency analysis.

Of course, the problem is hardly getting these stats,
problem is finding what constitutes a phrase. ;-)

-+ Tatu +-


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Hacking Luke for bytecount-based strings

2006-05-16 Thread Marvin Humphrey

Greets,

There does not seem to be a lot of demand for one implementation of  
Lucene to read indexes generated by another implementation of Lucene  
for the purposes of indexing or searching.  However, there is a  
demand for index browsing via Luke.


It occurred to me today that if Luke were powered by a version of  
Lucene with my bytecount-based-strings patch applied, it would be  
able to read indexes generated by Ferret.  Ironically, it wouldn't be  
able to read KinoSearch indexes unless I reverted the change which  
causes the term vectors to be stored in the .fdt file.  I'd probably  
do that.  Luke is great.


One possibility for distributing such a beast is to offer a patched  
jar for download from my website.  Before I start down that road,  
though, I thought I'd bring up the subject here.


Thoughts?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Hacking Luke for bytecount-based strings

2006-05-16 Thread Robert Engels
While you're at it, why not rewrite Luke in Perl as well...

Seems like a great use of your time.

-Original Message-
From: Marvin Humphrey [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 16, 2006 11:36 PM
To: java-dev@lucene.apache.org
Cc: Andrzej Bialecki
Subject: Hacking Luke for bytecount-based strings


Greets,

There does not seem to be a lot of demand for one implementation of  
Lucene to read indexes generated by another implementation of Lucene  
for the purposes of indexing or searching.  However, there is a  
demand for index browsing via Luke.

It occurred to me today that if Luke were powered by a version of  
Lucene with my bytecount-based-strings patch applied, it would be  
able to read indexes generated by Ferret.  Ironically, it wouldn't be  
able to read KinoSearch indexes unless I reverted the change which  
causes the term vectors to be stored in the .fdt file.  I'd probably  
do that.  Luke is great.

One possibility for distributing such a beast is to offer a patched  
jar for download from my website.  Before I start down that road,  
though, I thought I'd bring up the subject here.

Thoughts?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hacking Luke for bytecount-based strings

2006-05-16 Thread Paul Elschot
On Wednesday 17 May 2006 06:35, Marvin Humphrey wrote:
> Greets,
> 
> There does not seem to be a lot of demand for one implementation of  
> Lucene to read indexes generated by another implementation of Lucene  
> for the purposes of indexing or searching.  However, there is a  
> demand for index browsing via Luke.
> 
> It occurred to me today that if Luke were powered by a version of  
> Lucene with my bytecount-based-strings patch applied, it would be  
> able to read indexes generated by Ferret.  Ironically, it wouldn't be  
> able to read KinoSearch indexes unless I reverted the change which  
> causes the term vectors to be stored in the .fdt file.  I'd probably  
> do that.  Luke is great.

> 
> One possibility for distributing such a beast is to offer a patched  
> jar for download from my website.  Before I start down that road,  
> though, I thought I'd bring up the subject here.
> 
> Thoughts?

Try and invoke luke with the a lucene jar of your choice on the
classpath before luke itself:

java -cp lucene-core-1.9-rc1-dev.jar:lukeall.jar org.getopt.luke.Luke

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]