Re: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread markharw00d
But this brings up - has anyone run Lucene off a database trigger or are triggers known to be slow and bad for this use? I suspect the tricky bit would be knowing when to balancing the calls to Reader/Writer closes, opens and optimizes. Record updates are the usual fun and games involving a

Re: Lucene cuts the search results ?

2005-02-15 Thread markharw00d
Hi Pierre, Here's the response I gave the last time this question was raised:: The highlighter uses a number of pluggable services, one of which is the choice of Fragmenter implementation. This interface is for classes which decide the boundaries where to cut the original text into snippets. The

Highlighter: new support for encoding

2005-02-06 Thread markharw00d
Nicko Cadell was good enough to point out the issues involved with generating XHTML compliant markup with the highlighter and provided a patch to fix it. The main code has now been updated in the new SVN repository here: http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/

Re: query term frequency

2005-01-28 Thread markharw00d
This from the highlighter package will give you the IDF : WeightedTerm[] QueryTermExtractor.getIdfWeightedTerms(Query query, IndexReader reader, String fieldName) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional

Re: text highlighting

2005-01-26 Thread markharw00d
Michael Celona wrote: Does any have a working example of the highlighter class found in the sandbox? There are several in the accompanying Junit test: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/highlighter/src/test/org/apache/lucene/search/highlight/ Cheers Mark

Re: Lucene in Action: Batch indexing by using RAMDirectory

2005-01-22 Thread markharw00d
I posted a suggested solution to this some time ago: http://marc.theaimsgroup.com/?l=lucene-userm=108922279803667w=2 The overhead of doing these tests was negligible but I haven't tried it since TermVectors and the compound indexes were introduced. Oscar Picasso wrote: Hi, On page 52 of Lucene

Re: Question about Analyzer and words spelled in different languages

2005-01-17 Thread markharw00d
Writing this kind of an analyzer can be a bit of a hassle and the position increment of 0 might affect highlighting code The highlighter in the sandbox was refactored to support these kinds of analyzer some time ago so it shouldn't be a problem. The Junit test that come with the highlighter

RE: New Highlighter features + api

2005-01-03 Thread markharw00d
I dont believe there is an automated build procedure in place for the contents of the sandbox, consequently there are no Jars or Javadocs created from the source - you need to do this manually using the standard Java command line tools or with your IDE.

Re: New Highlighter features

2005-01-03 Thread markharw00d
Bruce Ritchie wrote: The Highlighter package in CVS has been updated with the following new features: Good stuff. Will this work against the 1.4 or only against CVS head? I think the TokenSources.java requires the latest CVS but is an optional part of the highlighter package. All other

[Fwd: API suggestion]

2004-12-06 Thread markharw00d
The documentation for the highlighter already covers how to handle wildcard queries. See the javadoc notes on query.rewrite. Cheers Mark ---BeginMessage--- Hello, I'm currently investigating improving the Highlighter currently supplied in the lucene sandbox. Especially we'd like to parse more

Faster highlighting with TermPositionVectors

2004-10-28 Thread markharw00d
Thanks to the recent changes (see CVS) in TermFreqVector support we can now make use of term offset information held in the Lucene index rather than incurring the cost of re-analyzing text to highlight it. I have created a class ( see http://www.inperspective.com/lucene/TokenSources.java )

RE: highlight the search word

2004-08-15 Thread markharw00d
Hi Pasha, I think the advice you gave is for an earlier version. With the latest version things have moved around and you would have to: //use a max fragment size size of text to ensure you get all text in one fragment highlighter.setTextFragmenter(new SimpleFragmenter(40)); //

RE: highlight the search word

2004-08-15 Thread markharw00d
I ask because in your example i should calculate the size of the initial text in any case. No calculation required - just use a really big number, (eg Integer.MAX_VALUE). This doesn't allocate any extra resources. Your highlightAll method sounds like it might be a useful addition. This how

Re: Highlighter and HTML tags

2004-08-10 Thread markharw00d
The highlighter certainly doesn't support this requirement currently - but it is designed to work with a pluggable choice of Formatter class should you choose to implement this specialized formatting code. The highlighter is typically used to select the best sections from a piece of text,

Re: Negative Boost

2004-08-04 Thread markharw00d
A solution to this has been proposed before - see http://wiki.apache.org/jakarta-lucene/CommunityContributions Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Highlighter package updated with overlapping token support

2004-07-26 Thread markharw00d
I have updated the Highlighter code in CVS to support tokenizers that generate overlapping tokens. The Junit test rig has a new example test that uses a SynonymTokenizer which generates multiple tokens in the same position for the same input token eg (the token football is expanded into

Re: Can I retrieve token offsets from Hits?

2004-07-22 Thread markharw00d
I wonder if the information in termPositions or termVector can be used to restore token position from indicies? TermFreqVector gives you term frequencies (not positions). This can be of use in computing document similarities. TermPositions gives you the sequence number . eg in the last

Re:amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread markharw00d
Yes, this issue has come up before with other choices of analyzers. I think it should be fixable without changing any of the highlighter APIs - can you email me or post here the source to your analyzer? Cheers Mark - To

Re: amusing interaction between advanced tokenizers and highlighter

2004-06-19 Thread markharw00d
A question before I dive into coding a fix: can I assume (for all analyzers) that the tokens produced by the tokenStream have the following property: currentToken.startOffset() = lastToken.startOffset() The analyzers I have tested the highlighter with so far have the property:

Re: How to extract matching terms for a document given a query

2004-06-16 Thread markharw00d
The reason the current highlighter is not suitable for me, is that the content of the document is not stored in the index That shouldn't present a problem. The working code example below was from a recent email discussion I had with someone who was storing text in a database. This simple example

Re: How to extract matching terms for a document given a query

2004-06-16 Thread markharw00d
Yes, highlighting multi-term queries does require a query.rewrite() call to expand those terms before calling the highlighter. BUT, you could load the results documents into a temporary RAMDirectory and expand the query by rewriting it against THAT instead of the original index - it would still

Re: Distributed searches and RAM Dir

2004-06-04 Thread markharw00d
Look up Mark Harwood and Lucene. ..provided some nice sequential UML diagrams with notes Those notes went missing recently when the ISP canned my free account. I've resurrected them at my new site here: http://www.inperspective.com/lucene/distrib/index.htm Cheers Mark

Re: Tool for analyzing analyzers

2004-05-28 Thread markharw00d
Hi Erik, I've had this running OK from the command line and in Eclipse on XP. I suspect it might be because you're running a different OS? The Classfinder tries to split the system property java.class.path on the ; character but I forgot different OSes have different seperators. As for Luke

Tool for analyzing analyzers

2004-05-27 Thread markharw00d
I've knocked together this tool which automatically discovers Analyzers on the classpath and provides a GUI to allow you to try out different Analyzers and see their effects: http://www.inperspective.com/lucene/Viewer.zip This needs JDK1.4 and you'll need to define the classpath to include

RE: org.apache.lucene.search.highlight.Highlighter

2004-05-25 Thread markharw00d
If the Content is Stored as... doc.add(Field.Text(contents, reader)); Thats just it. It's not stored : see the javadocs for Field.text(string,reader): Constructs a Reader-valued Field that is tokenized and indexed, but is not stored in the index As opposed to : Field.Text(String name, String

Re: Possible to fetch a document without all fields for performance?

2004-05-22 Thread markharw00d
I've put together some code to do this based on this API: Document document(int docNum, String [] fieldNames); You can now be selective about which fields you want to read off disk. It does offer some speed-ups but it is not as fast as it could be due to a limitation in the index file

Re: org.apache.lucene.search.highlight.Highlighter

2004-05-21 Thread markharw00d
Hi Claude, that example code you provided is out of date. For all concerned - the highlighter code was refactored about a month ago and then moved into the Sandbox. Want the latest version? - get the latest code from the sandbox CVS. Want the latest docs? - Run javadoc on the above. There is a

Re: org.apache.lucene.search.highlight.Highlighter

2004-05-19 Thread markharw00d
Was Investigating,found some Compile time error.. I see the code you have is taken from the example in the javadocs. Unfortunately that example wasn't complete because the class didnt include the method defined in the Formatter interface. I have updated the Javadocs to correct this oversight.

Re: Highlighter package v2 RC1

2004-04-10 Thread markharw00d
Can I customize the way it does highlight terms? Right now it does so by arounding with b. That's the job of a formatter class. You can pass one in the constructor eg: Formatter myFormatter=new SimpleHTMLFormatter(i,/i); Highlighter h=new Highlighter(myFormatter, new QueryScorer(query))); If

Highlighter package v2 RC1

2004-04-08 Thread markharw00d
I've reworked the highlighter package to address some issues (inability to pass fieldnames to analyzers, limiting tokenization of large docs) and have refactored it to be more modular so that folks can provide alternative implementations of the main functions (tokenizing, fragmenting and

re: Highlight package

2004-04-06 Thread markharw00d
Just got back from holiday to find that one of the ISPs I use shutdown my site with the highlighter code and a number of people have complained about the broken highlighter link from the Lucene site. I have temporarily put the missing highlighter code up here:

Re: Demoting results

2004-03-29 Thread markharw00d
Hi Doug, Thanks for the post. BoostingQuery looks to be cleaner, faster and more generally useful than my implementation :-) Unless anyone has a particularly good reason I'll remove the link to my code that Stephane put on the Wiki contributions page. I definitely find BoostingQuery very

Re: Demoting results

2004-03-28 Thread markharw00d
I've found an elegant way of doing this now for all types of search - a new NegatingQuery class that takes any Query object in its constructor and selects all documents that DONT match and gives them a user-definable boost. The code is here: http://www.inperspective.com/lucene/demote.zip

Re: Demoting results

2004-03-26 Thread markharw00d
I have not been able to work out how to get custom coordination going to demote results based on a specific term but have an alternative suggestion that looks like it might work: I've created a MissingTermQuery - which is the opposite of a TermQuery and can be used to boost documents that DONT

More Like This Query updated plus benchmarks

2004-02-29 Thread markharw00d
I have updated the MoreLikeThis query generator to address a few issues. The code is available here: http://home.clara.net/markharwood/lucene/MoreLikeThis.java I have added comments at the top of the class to describe the changes. I was interested in the benefits of the new TermVector code so I

RE: Lucene scalability/clustering

2004-02-26 Thread markharw00d
I tend to think of scaling in two dimensions: scaling by volumes of users and scaling by volumes of data. The former is addressed through replicated indexes and the latter by segmented indexes. Distribute replicated segments across multiple boxes and create a broker which a)Determines which

Problem using highlighter package

2004-02-18 Thread markharw00d
Hi Alex. Looks to me like you have a classpath problem - you're running with a version other than 1.3 final. Earlier versions of Lucene didn't have the 2 methods in your error messages. You'll need to check your classpath settings carefully. I downloaded the highlighter package made available by

Re: MoreLikeThis Query generator - Re: code for more like this

2004-02-17 Thread markharw00d
Here's the results of some tests using David's more like.. class. http://home.clara.net/markharwood/lucene/mlt.htm Looks useful. I have a couple of suggestions in the review. Cheers Mark - To unsubscribe, e-mail: [EMAIL

RE: Lucene 1.2 Hit Highlighting

2003-12-07 Thread markharw00d
Hi Ken, I've just had a look at the compatibility issues of my Highlighter package and Lucene 1.2. It looks like the following Lucene methods are not present in this version: BooleanQuery.getClauses(); PhraseQuery.getTerms() TermQuery.getTerm() and PriorityQueue.insert() However, if you

New highlighter package available

2003-09-24 Thread markharw00d
Details of a new highlighter package are available here: http://home.clara.net/markharwood/lucene/highlight.htm Features include: * Support for highlighting all query types * Support for getting best fragments summary from large docs * Works with latest version of Lucene Hope you find this