the Highlighter's getBestFragment method takes a TokenStream and a text.
Wouldn't it be easier to give it just the text and an analyzer
That's how it was originally coded. The move to TokenStream was a
deliberate choice, made in order to decouple the highlighter from the source of
tokens and ena
+1
For me its been a long time since a project mandated JDK 1.3 and even
then it was a Websphere app which wasn't using Lucene.
As for JDK 1.4, wasn't there talk of some potential benefit to be had
in the new NIO classes too?
Doug Cutting wrote:
Sigh. This stuff would get a lot simpler if we
It looks like Lucene does not use any of the BitSet boolean logic
operators ( and , or etc) - it just seems to use the "get" method to
test set membership for individual docs.
If this is true the DocIdSet would look like this:
public interface DocIdSet
{
public abstract boolean contains(i
Sounds like a good idea. It avoids issues for the novice users (who
haven't explicitly constructed filters) and simplifies the code of
experienced users who take the trouble to create filters manually.
If we intend to make more use of filters this may be an appropriate time
to raise a general q
Here's the first cut at the changes to fuzzy scoring:
http://www.inperspective.com/lucene/LuceneNewFuzzyScoring.zip
Paul, I haven't implemented the "tf" suggestions you made, I'm not sure
how this can be done efficiently yet. Even without this, results seem
to improve on existing scoring algorit
>>That's quick. Do you have a time shrinking machine there?
:) Actually, time's up. It'll be after Christmas before I spend any more
time on this now but initial results looked promising so I'll make some
code available, probably in the new year.
I've got an update to the highlighter to release t
Thanks for the suggestions, Paul.
I've just tried a scheme using the max docFreq of the expanded terms as
the docFreq shared by all expanded terms in their idf calculations
(giving a lower, shared, IDF) and I'm still removing the coordination
factor on the BooleanQuery that groups the term queri
The "questions on Hits.doc" thread on Lucene-user resurrects the issue of partial
loading of fields.
In summary: it would be nice to be able to read only the fields you need and I
proposed a solution on lucene-user some time ago here:
http://marc.theaimsgroup.com/?l=lucene-user&m=10852537682111
Adding support for phrases could be tricky.
So far I have deliberately avoided reimplementing specialized highlighting logic for
each of the different types of
queries eg understanding the nuances of "slop factor" in Phrase queries. I may be
wrong but adding specialized
support for different que
Wow. Thomas, can you share any details of who else is using Lucene by virtue
of the fact they use Akamai services? It would also be interesting to hear how you
manage the
distribution of indexes - (if you're in a position to share that kind of info!)
--
730 msecs is the correct number for 10 * 16k docs with StandardTokenizer!
The 11ms per doc figure in my post was for highlighlighting using a
lower-case-filter-only analyzer.
5ms of this figure was the cost of the lower-case-filter-only analyzer.
73 msecs is the cost of JUST StandardTokenizer (n
Here's the first cut of the RAMIndex alternative.
I've included a Junit test and some test data.
http://www.inperspective.com/lucene/fastindex.htm
There's still more to be done but I would appreciate any feedback at this stage.
Cheers
Mark
---
Doug,
To save any duplicated effort on your part: I've started work on the RAMDirectory
alternative you suggested last week:
>> It would be interesting to write an in-memory version of IndexReader and
>> IndexWriter
>>that don't serialize anything to bytes.
My current implementation is benchmar
I just re-ran the same tests but using SimpleAnalyzer (a lowercase filter only)
This time round responses were :
Tokenizing:5 ms avg per doc
Highlighting:11 ms avg per doc
RAM Indexing docs:39 ms avg per doc
RAM indexing still looks to add more than I would like.
Having reviewed my previous choi
Thanks for the response, Doug
My working assumption was that whatever analysis was done in evaluating the query
would be costly to repeat
but from your breadown of what is actually required it looks like all of my
requirements can be met based on
calls to IndexReader#docFreq(term) which I would
I think the TermScorer could be used to produce some useful feedback on performance of
terms used in queries with the addition of some new methods:
int getNumDocMatches();
float getAverageScore();
These could be used in the following scenarios:
* selecting which terms to offer spelling correction
>>Another approach that someone mentioned for solving this problem is to create a
>>fragment index for long documents.
Alternatively, could you use term sequence positions to guess where to start
extracting text from the doc?
If you have identified the best section of the doc based purely on ide
Doug,
nice suggestion about capping the highlighter's number of tokens - I'll add that in.
Bruce,
I've had a quick look at your knowledgebase docs. Can't you split them at index time
into multiple smaller docs using the tags as doc boundaries?
Each lucene document could then have a field with th
I've just run some stats on the overhead of tokenizing text/highlighting.
It looks like its tokenizing that's the main problem and it is CPU bound.
I ran three tests, all on the same index/machine : pentium 3 800mhz, 360mb index,
lucene 1.3 final, JDK 1.4.1, Porter stemmer based analyser.
For
I'm not sure what applications people have in mind for Term Vector support but I
would prefer to have the original text positions (not term sequence positions) stored
so I can offer this:
1) Significant terms/phrases identification
Like "Gigabits" on gigablast.com - used to offer choices of (uns
Here are some very important reasons why getTerms() shouldn't be added as a method to
Query:
Query objects are seen by Lucene users as reusable objects.
Eg they could be used as routing queries which are run repeatedly to classify incoming
documents.
They are are re-usable across multiple inde
With regards to Korfut's TermCollector proposition:
I do not like the new requirement for all query classes to implement getTerms(). This
is effectively what they are currently
required to do in the query.rewrite() method - express their high-level logic in
primitive terms.
I beleive the getTerm
Hi Korfut
>>As for Mark works of the highlighter, it is not working with \
>>release 1.3, due to big changes in the core, query rewrite, termenum, etc
The Junit test that accompanies my code tests all query types just fine running with
the version I took from CVS as of 20/9/2003.
When you say "
>>From what I remember when doing similar patches to Lucene core, alternative
>>way (not adding any new support) requires one to dig deep into implementation
>>details of Lucene term and query objects, breaking encapsulation
That may well used to be the case but not since query.rewrite() was intro
My intention is for this submission to be used however you see fit.
If that's in the core or not I dont really mind.
What I would like to see however is any none-core projects that are considered useful
having an
automated mechanism for building and Junit testing against the latest Lucene release.
Thanks for the feedback on the highlighter package.
Here are some responses to the issues raised:
>>what may be the performance implications seeing that
>>the method query.rewrite(reader) seems to be called twice, one for
>>querying, once for highlighting.
One solution is to do this before callin
26 matches
Mail list logo