But this brings up - has anyone run Lucene off a database trigger or
are triggers known to be slow and bad for this use?
I suspect the tricky bit would be knowing when to balancing the calls to
Reader/Writer closes, opens and optimizes.
Record updates are the usual fun and games involving a
Hi Pierre,
Here's the response I gave the last time this question was raised::
The highlighter uses a number of pluggable services, one of which is the
choice of Fragmenter implementation. This interface is for classes which
decide the boundaries where to cut the original text into snippets. The
Nicko Cadell was good enough to point out the issues involved with
generating XHTML compliant markup with the highlighter and provided a
patch to fix it.
The main code has now been updated in the new SVN repository here:
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/
This from the highlighter package will give you the IDF :
WeightedTerm[] QueryTermExtractor.getIdfWeightedTerms(Query query,
IndexReader reader, String fieldName)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional
Michael Celona wrote:
Does any have a working example of the highlighter class found in the
sandbox?
There are several in the accompanying Junit test:
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/highlighter/src/test/org/apache/lucene/search/highlight/
Cheers
Mark
I posted a suggested solution to this some time ago:
http://marc.theaimsgroup.com/?l=lucene-userm=108922279803667w=2
The overhead of doing these tests was negligible but I haven't tried it
since TermVectors and the compound indexes were introduced.
Oscar Picasso wrote:
Hi,
On page 52 of Lucene
Writing this kind of an analyzer can be a bit of a hassle and the
position increment of 0 might affect highlighting code
The highlighter in the sandbox was refactored to support these kinds of
analyzer some time ago so it shouldn't be a problem. The Junit test
that come with the highlighter
I dont believe there is an automated build procedure in place for the
contents of the sandbox, consequently there are no Jars or Javadocs
created from the source - you need to do this manually using the
standard Java command line tools or with your IDE.
Bruce Ritchie wrote:
The Highlighter package in CVS has been updated with the following new
features:
Good stuff. Will this work against the 1.4 or only against CVS head?
I think the TokenSources.java requires the latest CVS but is an optional
part of the highlighter package. All other
The documentation for the highlighter already covers how to handle
wildcard queries.
See the javadoc notes on query.rewrite.
Cheers
Mark
---BeginMessage---
Hello,
I'm currently investigating improving the Highlighter currently
supplied in the lucene sandbox. Especially we'd like to parse
more
Thanks to the recent changes (see CVS) in TermFreqVector support we can now make use
of term offset information held
in the Lucene index rather than incurring the cost of re-analyzing text to highlight
it.
I have created a class ( see http://www.inperspective.com/lucene/TokenSources.java )
Hi Pasha,
I think the advice you gave is for an earlier version.
With the latest version things have moved around and you would have to:
//use a max fragment size size of text to ensure you get all text in one fragment
highlighter.setTextFragmenter(new SimpleFragmenter(40));
//
I ask because in your example i should calculate the size of the initial text in any
case.
No calculation required - just use a really big number, (eg Integer.MAX_VALUE). This
doesn't allocate any extra resources.
Your highlightAll method sounds like it might be a useful addition. This how
The highlighter certainly doesn't support this requirement currently - but it is
designed to work with
a pluggable choice of Formatter class should you choose to implement this specialized
formatting code.
The highlighter is typically used to select the best sections from a piece of text,
A solution to this has been proposed before - see
http://wiki.apache.org/jakarta-lucene/CommunityContributions
Cheers
Mark
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
I have updated the Highlighter code in CVS to support tokenizers that generate
overlapping tokens.
The Junit test rig has a new example test that uses a SynonymTokenizer which
generates multiple tokens
in the same position for the same input token eg (the token football is expanded
into
I wonder if the information in termPositions or termVector can be used
to restore token position from indicies?
TermFreqVector gives you term frequencies (not positions). This can be of use in
computing document
similarities.
TermPositions gives you the sequence number . eg in the last
Yes, this issue has come up before with other choices of analyzers.
I think it should be fixable without changing any of the highlighter APIs
- can you email me or post here the source to your analyzer?
Cheers
Mark
-
To
A question before I dive into coding a fix: can I assume (for all analyzers) that the
tokens produced by the tokenStream
have the following property:
currentToken.startOffset() = lastToken.startOffset()
The analyzers I have tested the highlighter with so far have the property:
The reason the current highlighter is not suitable for me, is that the
content of the document is not stored in the index
That shouldn't present a problem.
The working code example below was from a recent email discussion I had with someone
who was storing
text in a database. This simple example
Yes, highlighting multi-term queries does require a query.rewrite() call to expand
those terms before
calling the highlighter.
BUT, you could load the results documents into a temporary RAMDirectory and expand the
query by rewriting it
against THAT instead of the original index - it would still
Look up Mark Harwood and Lucene. ..provided some nice sequential
UML diagrams with notes
Those notes went missing recently when the ISP canned my free account.
I've resurrected them at my new site here:
http://www.inperspective.com/lucene/distrib/index.htm
Cheers
Mark
Hi Erik,
I've had this running OK from the command line and in Eclipse on XP.
I suspect it might be because you're running a different OS? The Classfinder tries
to split the system property
java.class.path on the ; character but I forgot different OSes have different
seperators.
As for Luke
I've knocked together this tool which automatically discovers Analyzers on the
classpath and provides a GUI to allow you to try out different Analyzers and see their
effects:
http://www.inperspective.com/lucene/Viewer.zip
This needs JDK1.4 and you'll need to define the classpath to include
If the Content is Stored as...
doc.add(Field.Text(contents, reader));
Thats just it. It's not stored : see the javadocs for Field.text(string,reader):
Constructs a Reader-valued Field that is tokenized and indexed, but is not stored in
the index
As opposed to :
Field.Text(String name, String
I've put together some code to do this based on this API:
Document document(int docNum, String [] fieldNames);
You can now be selective about which fields you want to read off disk.
It does offer some speed-ups but it is not as fast as it could be due to a limitation
in the index
file
Hi Claude, that example code you provided is out of date.
For all concerned - the highlighter code was refactored about a month ago and then
moved into the Sandbox.
Want the latest version? - get the latest code from the sandbox CVS.
Want the latest docs? - Run javadoc on the above.
There is a
Was Investigating,found some Compile time error..
I see the code you have is taken from the example in the javadocs. Unfortunately that
example wasn't complete because the class didnt
include the method defined in the Formatter interface. I have updated the Javadocs to
correct this oversight.
Can I customize the way it does highlight terms? Right now it does so by arounding
with b.
That's the job of a formatter class. You can pass one in the constructor eg:
Formatter myFormatter=new SimpleHTMLFormatter(i,/i);
Highlighter h=new Highlighter(myFormatter, new QueryScorer(query)));
If
I've reworked the highlighter package to address some issues (inability to pass
fieldnames to analyzers,
limiting tokenization of large docs) and have refactored it to be more modular so that
folks
can provide alternative implementations of the main functions (tokenizing, fragmenting
and
Just got back from holiday to find that one of the ISPs I use shutdown my site with
the highlighter code and a number of people have
complained about the broken highlighter link from the Lucene site.
I have temporarily put the missing highlighter code up here:
Hi Doug,
Thanks for the post. BoostingQuery looks to be cleaner, faster and more generally
useful than my
implementation :-)
Unless anyone has a particularly good reason I'll remove the link to my code that
Stephane put on the Wiki contributions page.
I definitely find BoostingQuery very
I've found an elegant way of doing this now for all types of search - a new
NegatingQuery class that takes any Query object in its constructor and
selects all documents that DONT match and gives them a user-definable boost.
The code is here:
http://www.inperspective.com/lucene/demote.zip
I have not been able to work out how to get custom coordination going to
demote results based on a specific term but have an alternative suggestion
that looks like it might work:
I've created a MissingTermQuery - which is the opposite of a TermQuery
and can be used to boost documents that DONT
I have updated the MoreLikeThis query generator to address a few issues.
The code is available here: http://home.clara.net/markharwood/lucene/MoreLikeThis.java
I have added comments at the top of the class to describe the changes.
I was interested in the benefits of the new TermVector code so I
I tend to think of scaling in two dimensions: scaling by volumes of users and scaling
by volumes of data. The former is addressed through replicated indexes
and the latter by segmented indexes.
Distribute replicated segments across multiple boxes and create a broker which
a)Determines which
Hi Alex.
Looks to me like you have a classpath problem - you're running with a version other
than 1.3 final.
Earlier versions of Lucene didn't have the 2 methods in your error messages.
You'll need to check your classpath settings carefully.
I downloaded the highlighter package made available by
Here's the results of some tests using David's more like.. class.
http://home.clara.net/markharwood/lucene/mlt.htm
Looks useful. I have a couple of suggestions in the review.
Cheers
Mark
-
To unsubscribe, e-mail: [EMAIL
Hi Ken,
I've just had a look at the compatibility issues of my Highlighter package and Lucene
1.2.
It looks like the following Lucene methods are not present in this version:
BooleanQuery.getClauses();
PhraseQuery.getTerms()
TermQuery.getTerm() and
PriorityQueue.insert()
However, if you
Details of a new highlighter package are available here:
http://home.clara.net/markharwood/lucene/highlight.htm
Features include:
* Support for highlighting all query types
* Support for getting best fragments summary from large docs
* Works with latest version of Lucene
Hope you find this
40 matches
Mail list logo