Re: Highlighter API

2005-02-18 Thread markharw00d
the Highlighter's getBestFragment method takes a TokenStream and a text. Wouldn't it be easier to give it just the text and an analyzer That's how it was originally coded. The move to TokenStream was a deliberate choice, made in order to decouple the highlighter from the source of tokens and ena

Re: what if the IndexReader crashes, after delete, before close.

2005-01-11 Thread markharw00d
+1 For me its been a long time since a project mandated JDK 1.3 and even then it was a Websphere app which wasn't using Lucene. As for JDK 1.4, wasn't there talk of some potential benefit to be had in the new NIO classes too? Doug Cutting wrote: Sigh. This stuff would get a lot simpler if we

Re: auto-filters?

2005-01-03 Thread markharw00d
It looks like Lucene does not use any of the BitSet boolean logic operators ( and , or etc) - it just seems to use the "get" method to test set membership for individual docs. If this is true the DocIdSet would look like this: public interface DocIdSet { public abstract boolean contains(i

Re: auto-filters?

2005-01-03 Thread markharw00d
Sounds like a good idea. It avoids issues for the novice users (who haven't explicitly constructed filters) and simplifies the code of experienced users who take the trouble to create filters manually. If we intend to make more use of filters this may be an appropriate time to raise a general q

Fuzzy scoring changes available

2004-12-28 Thread markharw00d
Here's the first cut at the changes to fuzzy scoring: http://www.inperspective.com/lucene/LuceneNewFuzzyScoring.zip Paul, I haven't implemented the "tf" suggestions you made, I'm not sure how this can be done efficiently yet. Even without this, results seem to improve on existing scoring algorit

Re: More fuzzy issues - encouraging bad spelling?

2004-12-23 Thread markharw00d
>>That's quick. Do you have a time shrinking machine there? :) Actually, time's up. It'll be after Christmas before I spend any more time on this now but initial results looked promising so I'll make some code available, probably in the new year. I've got an update to the highlighter to release t

Re: More fuzzy issues - encouraging bad spelling?

2004-12-23 Thread markharw00d
Thanks for the suggestions, Paul. I've just tried a scheme using the max docFreq of the expanded terms as the docFreq shared by all expanded terms in their idf calculations (giving a lower, shared, IDF) and I'm still removing the coordination factor on the BooleanQuery that groups the term queri

Partial read of document fields

2004-09-10 Thread markharw00d
The "questions on Hits.doc" thread on Lucene-user resurrects the issue of partial loading of fields. In summary: it would be nice to be able to read only the fields you need and I proposed a solution on lucene-user some time ago here: http://marc.theaimsgroup.com/?l=lucene-user&m=10852537682111

Re: highlighting phrases

2004-09-01 Thread markharw00d
Adding support for phrases could be tricky. So far I have deliberately avoided reimplementing specialized highlighting logic for each of the different types of queries eg understanding the nuances of "slop factor" in Phrase queries. I may be wrong but adding specialized support for different que

Re: New site using Lucene - Akamai.com

2004-04-23 Thread markharw00d
Wow. Thomas, can you share any details of who else is using Lucene by virtue of the fact they use Akamai services? It would also be interesting to hear how you manage the distribution of indexes - (if you're in a position to share that kind of info!) --

Re: Performance of hit highlighting and finding term positions for

2004-04-01 Thread markharw00d
730 msecs is the correct number for 10 * 16k docs with StandardTokenizer! The 11ms per doc figure in my post was for highlighlighting using a lower-case-filter-only analyzer. 5ms of this figure was the cost of the lower-case-filter-only analyzer. 73 msecs is the cost of JUST StandardTokenizer (n

Re: Proposal: extracting term-level stats from query process

2004-03-23 Thread markharw00d
Here's the first cut of the RAMIndex alternative. I've included a Junit test and some test data. http://www.inperspective.com/lucene/fastindex.htm There's still more to be done but I would appreciate any feedback at this stage. Cheers Mark ---

Re: Proposal: extracting term-level stats from query process

2004-03-17 Thread markharw00d
Doug, To save any duplicated effort on your part: I've started work on the RAMDirectory alternative you suggested last week: >> It would be interesting to write an in-memory version of IndexReader and >> IndexWriter >>that don't serialize anything to bytes. My current implementation is benchmar

Re: Proposal: extracting term-level stats from query process

2004-03-11 Thread markharw00d
I just re-ran the same tests but using SimpleAnalyzer (a lowercase filter only) This time round responses were : Tokenizing:5 ms avg per doc Highlighting:11 ms avg per doc RAM Indexing docs:39 ms avg per doc RAM indexing still looks to add more than I would like. Having reviewed my previous choi

Re: Proposal: extracting term-level stats from query process

2004-03-11 Thread markharw00d
Thanks for the response, Doug My working assumption was that whatever analysis was done in evaluating the query would be costly to repeat but from your breadown of what is actually required it looks like all of my requirements can be met based on calls to IndexReader#docFreq(term) which I would

Proposal: extracting term-level stats from query process

2004-03-11 Thread markharw00d
I think the TermScorer could be used to produce some useful feedback on performance of terms used in queries with the addition of some new methods: int getNumDocMatches(); float getAverageScore(); These could be used in the following scenarios: * selecting which terms to offer spelling correction

Re: Dmitry's Term Vector stuff, plus some

2004-02-26 Thread markharw00d
>>Another approach that someone mentioned for solving this problem is to create a >>fragment index for long documents. Alternatively, could you use term sequence positions to guess where to start extracting text from the doc? If you have identified the best section of the doc based purely on ide

Re: Dmitry's Term Vector stuff, plus some

2004-02-25 Thread markharw00d
Doug, nice suggestion about capping the highlighter's number of tokens - I'll add that in. Bruce, I've had a quick look at your knowledgebase docs. Can't you split them at index time into multiple smaller docs using the tags as doc boundaries? Each lucene document could then have a field with th

Re: Dmitry's Term Vector stuff, plus some

2004-02-25 Thread markharw00d
I've just run some stats on the overhead of tokenizing text/highlighting. It looks like its tokenizing that's the main problem and it is CPU bound. I ran three tests, all on the same index/machine : pentium 3 800mhz, 360mb index, lucene 1.3 final, JDK 1.4.1, Porter stemmer based analyser. For

Re: Dmitry's Term Vector stuff, plus some

2004-02-24 Thread markharw00d
I'm not sure what applications people have in mind for Term Vector support but I would prefer to have the original text positions (not term sequence positions) stored so I can offer this: 1) Significant terms/phrases identification Like "Gigabits" on gigablast.com - used to offer choices of (uns

Re: Query Term Collector (was: Re: New highlighter package available)

2003-10-05 Thread markharw00d
Here are some very important reasons why getTerms() shouldn't be added as a method to Query: Query objects are seen by Lucene users as reusable objects. Eg they could be used as routing queries which are run repeatedly to classify incoming documents. They are are re-usable across multiple inde

Re: Query Term Collector (was: Re: New highlighter package available)

2003-10-04 Thread markharw00d
With regards to Korfut's TermCollector proposition: I do not like the new requirement for all query classes to implement getTerms(). This is effectively what they are currently required to do in the query.rewrite() method - express their high-level logic in primitive terms. I beleive the getTerm

RE : New highlighter package available

2003-10-02 Thread markharw00d
Hi Korfut >>As for Mark works of the highlighter, it is not working with \ >>release 1.3, due to big changes in the core, query rewrite, termenum, etc The Junit test that accompanies my code tests all query types just fine running with the version I took from CVS as of 20/9/2003. When you say "

Re: New highlighter package available

2003-10-02 Thread markharw00d
>>From what I remember when doing similar patches to Lucene core, alternative >>way (not adding any new support) requires one to dig deep into implementation >>details of Lucene term and query objects, breaking encapsulation That may well used to be the case but not since query.rewrite() was intro

Re: New highlighter package available

2003-09-30 Thread markharw00d
My intention is for this submission to be used however you see fit. If that's in the core or not I dont really mind. What I would like to see however is any none-core projects that are considered useful having an automated mechanism for building and Junit testing against the latest Lucene release.

RE : New highlighter package available

2003-09-25 Thread markharw00d
Thanks for the feedback on the highlighter package. Here are some responses to the issues raised: >>what may be the performance implications seeing that >>the method query.rewrite(reader) seems to be called twice, one for >>querying, once for highlighting. One solution is to do this before callin