>>But this brings up - has anyone run Lucene off a database trigger or
are triggers known to be slow and bad for this use?
I suspect the tricky bit would be knowing when to balancing the calls to
Reader/Writer closes, opens and optimizes.
Record updates are the usual fun and games involving a r
Hi Pierre,
Here's the response I gave the last time this question was raised::
The highlighter uses a number of "pluggable" services, one of which is the
choice of "Fragmenter" implementation. This interface is for classes which
decide the boundaries where to cut the original text into snippets. Th
Nicko Cadell was good enough to point out the issues involved with
generating XHTML compliant markup with the highlighter and provided a
patch to fix it.
The main code has now been updated in the new SVN repository here:
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/
To
This from the highlighter package will give you the IDF :
WeightedTerm[] QueryTermExtractor.getIdfWeightedTerms(Query query,
IndexReader reader, String fieldName)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional comman
Michael Celona wrote:
Does any have a working example of the highlighter class found in the
sandbox?
There are several in the accompanying Junit test:
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/highlighter/src/test/org/apache/lucene/search/highlight/
Cheers
Mark
-
I posted a suggested solution to this some time ago:
http://marc.theaimsgroup.com/?l=lucene-user&m=108922279803667&w=2
The overhead of doing these tests was negligible but I haven't tried it
since TermVectors and the compound indexes were introduced.
Oscar Picasso wrote:
Hi,
On page 52 of Lucene
>>Writing this kind of an analyzer can be a bit of a hassle and the
position increment of 0 might affect highlighting code
The highlighter in the sandbox was refactored to support these kinds of
analyzer some time ago so it shouldn't be a problem. The Junit test
that come with the highlighter
Bruce Ritchie wrote:
The Highlighter package in CVS has been updated with the following new
features:
Good stuff. Will this work against the 1.4 or only against CVS head?
I think the TokenSources.java requires the latest CVS but is an optional
part of the highlighter package. All other co
I dont believe there is an automated build procedure in place for the
contents of the sandbox, consequently there are no Jars or Javadocs
created from the source - you need to do this manually using the
standard Java command line tools or with your IDE.
-
The Highlighter package in CVS has been updated with the following new
features:
* GradientFormatter is a new formatter that can be used to change the
colour intensity of matching terms, based on their score. I have found
this to be a useful way of visualizing the basis of query matches,
espec
The documentation for the highlighter already covers how to handle
wildcard queries.
See the javadoc notes on query.rewrite.
Cheers
Mark
--- Begin Message ---
Hello,
I'm currently investigating improving the Highlighter currently
supplied in the lucene sandbox. Especially we'd like to parse
mor
Having revisited the original TokenSources code it looks like one of the
optimisations I put in will fail if fields are stored with
non-contiguous position info (ie the analyzer has messed with token
position numbers so they overlap or have gaps like ..3,3,7,8,9,..).
I've now made the TokenSourc
Thanks to the recent changes (see CVS) in TermFreqVector support we can now make use
of term offset information held
in the Lucene index rather than incurring the cost of re-analyzing text to highlight
it.
I have created a class ( see http://www.inperspective.com/lucene/TokenSources.java )
wh
>>I ask because in your example i should calculate the size of the initial text in any
>>case.
No calculation required - just use a really big number, (eg Integer.MAX_VALUE). This
doesn't allocate any extra resources.
Your "highlightAll" method sounds like it might be a useful addition. This
Hi Pasha,
I think the advice you gave is for an earlier version.
With the latest version things have moved around and you would have to:
//use a max fragment size > size of text to ensure you get all text in one fragment
highlighter.setTextFragmenter(new SimpleFragmenter(40));
// c
The highlighter certainly doesn't support this requirement currently - but it is
designed to work with
a pluggable choice of Formatter class should you choose to implement this specialized
formatting code.
The highlighter is typically used to select the "best" sections from a piece of text,
to
A solution to this has been proposed before - see
http://wiki.apache.org/jakarta-lucene/CommunityContributions
Cheers
Mark
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
I have updated the Highlighter code in CVS to support tokenizers that generate
overlapping tokens.
The Junit test rig has a new example test that uses a "SynonymTokenizer" which
generates multiple tokens
in the same position for the same input token eg (the token "football" is expanded
into to
> I wonder if the information in termPositions or termVector can be used
> to restore token position from indicies?
TermFreqVector gives you term frequencies (not positions). This can be of use in
computing document
similarities.
TermPositions gives you the sequence number . eg in the last sente
> I need these values for hihglighting. I've already looked to
> Highlighter in sandbox but it actually re-analyzes the original
> document's field.
Technically not true, as of a few months ago. The good news is the highlighter has
been redesigned
specifically to use TokenStreams not Analyzers.
Would it make more sense to use a parameter defining RAM size for the cache rather
than minMergeDocs?
Tuning RAM usage is the real issue here and controlling this by guessing the number of
docs you can
squeeze into RAM is not the most helpful approach. How about a "setMaxCacheSize(int
megabytes
A colleague of mine found the fastest way to index was to use a RAMDirectory, letting
it grow
to a pre-defined maximum size, then merging it to a new temporary file-based index to
flush it. Repeat this, creating new directories for all the file based indexes then
perform
a merge into one index o
>>Is there a way to get the whole document in result.
Use one big fragment...
highlighter.setTextFragmenter(new SimpleFragmenter(100));
Cheers
Mark
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-ma
I think this version of the highlighter should provide a fix:
http://www.inperspective.com/lucene/hilite2beta.zip
Before I update the version of the highlighter in the sandbox I'd appreciate feedback
from those troubled
with the issues to do with overlapping tokens in token streams (Erik, Dave,
A question before I dive into coding a fix: can I assume (for all analyzers) that the
tokens produced by the tokenStream
have the following property:
currentToken.startOffset() >= lastToken.startOffset()
The analyzers I have tested the highlighter with so far have the property:
currentTok
Yes, this issue has come up before with other choices of analyzers.
I think it should be fixable without changing any of the highlighter APIs
- can you email me or post here the source to your analyzer?
Cheers
Mark
-
To unsubscr
Yes, highlighting multi-term queries does require a query.rewrite() call to expand
those terms before
calling the highlighter.
BUT, you could load the results documents into a temporary RAMDirectory and expand the
query by rewriting it
against THAT instead of the original index - it would still
>>The reason the current highlighter is not suitable for me, is that the
>>content of the document is not stored in the index
That shouldn't present a problem.
The working code example below was from a recent email discussion I had with someone
who was storing
text in a database. This simple exam
>> Look up Mark Harwood and Lucene. ..provided some nice sequential
>>UML diagrams with notes
Those notes went missing recently when the ISP canned my free account.
I've resurrected them at my new site here:
http://www.inperspective.com/lucene/distrib/index.htm
Cheers
Mark
-
Hi Erik,
I've had this running OK from the command line and in Eclipse on XP.
I suspect it might be because you're running a different OS? The "Classfinder" tries
to split the system property
"java.class.path" on the ";" character but I forgot different OSes have different
seperators.
As for Lu
I've knocked together this tool which automatically discovers Analyzers on the
classpath and provides a GUI to allow you to try out different Analyzers and see their
effects:
http://www.inperspective.com/lucene/Viewer.zip
This needs JDK1.4 and you'll need to define the classpath to include Luce
>>If the Content is Stored as...
>>doc.add(Field.Text("contents", reader));
Thats just it. It's not stored : see the javadocs for Field.text(string,reader):
"Constructs a Reader-valued Field that is tokenized and indexed, but is not stored in
the index"
As opposed to :
Field.Text(String name,
I've put together some code to do this based on this API:
Document document(int docNum, String [] fieldNames);
You can now be selective about which fields you want to read off disk.
It does offer some speed-ups but it is not as fast as it could be due to a limitation
in the index
file forma
Hi Claude, that example code you provided is out of date.
For all concerned - the highlighter code was refactored about a month ago and then
moved into the Sandbox.
Want the latest version? - get the latest code from the sandbox CVS.
Want the latest docs? - Run javadoc on the above.
There is a
>>Was Investigating,found some Compile time error..
I see the code you have is taken from the example in the javadocs. Unfortunately that
example wasn't complete because the class didnt
include the method defined in the Formatter interface. I have updated the Javadocs to
correct this oversight.
>>Can I customize the way it does highlight terms? Right now it does so by arounding
>>with .
That's the job of a formatter class. You can pass one in the constructor eg:
Formatter myFormatter=new SimpleHTMLFormatter("","");
Highlighter h=new Highlighter(myFormatter, new QueryScorer(query)));
If
I've reworked the highlighter package to address some issues (inability to pass
fieldnames to analyzers,
limiting tokenization of large docs) and have refactored it to be more modular so that
folks
can provide alternative implementations of the main functions (tokenizing, fragmenting
and scoring
Just got back from holiday to find that one of the ISPs I use shutdown my site with
the highlighter code and a number of people have
complained about the broken "highlighter" link from the Lucene site.
I have temporarily put the missing highlighter code up here:
http://www.inperspective.com/luc
730 msecs is the correct number for 10 * 16k docs with StandardTokenizer!
The 11ms per doc figure in my post was for highlighlighting using a \
lower-case-filter-only analyzer. 5ms of this figure was the cost of the \
lower-case-filter-only analyzer.
73 msecs is the cost of JUST StandardTokenizer
>>Folks have benchmarked this, and, for documents less than 10k characters or so,
>>re-tokenizing is fast enough.
As a note of warning: I did find StandardTokenizer to be the major culprit in my
tokenizing benchmarks (avg 75ms for 16k sized docs).
I have found I can live without StandardTokenize
>>You could, if you fail to find any fragments that match the entire
>>query, re-query the fragments with a flattened query containing just an
>>OR of all of the original query terms.
The other issue with this approach I'm still struggling with is simply the cost of
creating the temporary index
Hi Doug,
Thanks for the post. BoostingQuery looks to be cleaner, faster and more generally
useful than my
implementation :-)
Unless anyone has a particularly good reason I'll remove the link to my code that
Stephane put on the Wiki contributions page.
I definitely find BoostingQuery very useful
I've found an elegant way of doing this now for all types of search - a new
"NegatingQuery" class that takes any Query object in its constructor and
selects all documents that DONT match and gives them a user-definable boost.
The code is here:
http://www.inperspective.com/lucene/demote.zip
Chee
I have not been able to work out how to get custom coordination going to
demote results based on a specific term but have an alternative suggestion
that looks like it might work:
I've created a "MissingTermQuery" - which is the opposite of a TermQuery
and can be used to boost documents that DONT
I have updated the MoreLikeThis query generator to address a few issues.
The code is available here: http://home.clara.net/markharwood/lucene/MoreLikeThis.java
I have added comments at the top of the class to describe the changes.
I was interested in the benefits of the new TermVector code so I be
I tend to think of scaling in two dimensions: scaling by volumes of users and scaling
by volumes of data. The former is addressed through replicated indexes
and the latter by segmented indexes.
Distribute replicated segments across multiple boxes and create a broker which
a)Determines which segm
Hi Alex.
Looks to me like you have a classpath problem - you're running with a version other
than 1.3 final.
Earlier versions of Lucene didn't have the 2 methods in your error messages.
You'll need to check your classpath settings carefully.
>>I downloaded the highlighter package made available b
Here's the results of some tests using David's "more like.." class.
http://home.clara.net/markharwood/lucene/mlt.htm
Looks useful. I have a couple of suggestions in the review.
Cheers
Mark
-
To unsubscribe, e-mail: [EMAIL PROTE
Hi Ken,
I've just had a look at the compatibility issues of my Highlighter package and Lucene
1.2.
It looks like the following Lucene methods are not present in this version:
BooleanQuery.getClauses();
PhraseQuery.getTerms()
TermQuery.getTerm() and
PriorityQueue.insert()
However, if you
Details of a new highlighter package are available here:
http://home.clara.net/markharwood/lucene/highlight.htm
Features include:
* Support for highlighting all query types
* Support for getting "best fragments" summary from large docs
* Works with latest version of Lucene
Hope you find this use
50 matches
Mail list logo