Re: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread markharw00d
>>But this brings up - has anyone run Lucene off a database trigger or are triggers known to be slow and bad for this use? I suspect the tricky bit would be knowing when to balancing the calls to Reader/Writer closes, opens and optimizes. Record updates are the usual fun and games involving a r

Re: Lucene "cuts" the search results ?

2005-02-15 Thread markharw00d
Hi Pierre, Here's the response I gave the last time this question was raised:: The highlighter uses a number of "pluggable" services, one of which is the choice of "Fragmenter" implementation. This interface is for classes which decide the boundaries where to cut the original text into snippets. Th

Highlighter: new support for encoding

2005-02-06 Thread markharw00d
Nicko Cadell was good enough to point out the issues involved with generating XHTML compliant markup with the highlighter and provided a patch to fix it. The main code has now been updated in the new SVN repository here: http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/ To

Re: query term frequency

2005-01-28 Thread markharw00d
This from the highlighter package will give you the IDF : WeightedTerm[] QueryTermExtractor.getIdfWeightedTerms(Query query, IndexReader reader, String fieldName) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional comman

Re: text highlighting

2005-01-26 Thread markharw00d
Michael Celona wrote: Does any have a working example of the highlighter class found in the sandbox? There are several in the accompanying Junit test: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/highlighter/src/test/org/apache/lucene/search/highlight/ Cheers Mark -

Re: Lucene in Action: Batch indexing by using RAMDirectory

2005-01-22 Thread markharw00d
I posted a suggested solution to this some time ago: http://marc.theaimsgroup.com/?l=lucene-user&m=108922279803667&w=2 The overhead of doing these tests was negligible but I haven't tried it since TermVectors and the compound indexes were introduced. Oscar Picasso wrote: Hi, On page 52 of Lucene

Re: Question about Analyzer and words spelled in different languages

2005-01-17 Thread markharw00d
>>Writing this kind of an analyzer can be a bit of a hassle and the position increment of 0 might affect highlighting code The highlighter in the sandbox was refactored to support these kinds of analyzer some time ago so it shouldn't be a problem. The Junit test that come with the highlighter

Re: New Highlighter features

2005-01-03 Thread markharw00d
Bruce Ritchie wrote: The Highlighter package in CVS has been updated with the following new features: Good stuff. Will this work against the 1.4 or only against CVS head? I think the TokenSources.java requires the latest CVS but is an optional part of the highlighter package. All other co

RE: New Highlighter features + api

2005-01-03 Thread markharw00d
I dont believe there is an automated build procedure in place for the contents of the sandbox, consequently there are no Jars or Javadocs created from the source - you need to do this manually using the standard Java command line tools or with your IDE. -

New Highlighter features

2005-01-02 Thread markharw00d
The Highlighter package in CVS has been updated with the following new features: * GradientFormatter is a new formatter that can be used to change the colour intensity of matching terms, based on their score. I have found this to be a useful way of visualizing the basis of query matches, espec

[Fwd: API suggestion]

2004-12-06 Thread markharw00d
The documentation for the highlighter already covers how to handle wildcard queries. See the javadoc notes on query.rewrite. Cheers Mark --- Begin Message --- Hello, I'm currently investigating improving the Highlighter currently supplied in the lucene sandbox. Especially we'd like to parse mor

Re: Faster highlighting with TermPositionVectors (update)

2004-11-04 Thread markharw00d
Having revisited the original TokenSources code it looks like one of the optimisations I put in will fail if fields are stored with non-contiguous position info (ie the analyzer has messed with token position numbers so they overlap or have gaps like ..3,3,7,8,9,..). I've now made the TokenSourc

Faster highlighting with TermPositionVectors

2004-10-28 Thread markharw00d
Thanks to the recent changes (see CVS) in TermFreqVector support we can now make use of term offset information held in the Lucene index rather than incurring the cost of re-analyzing text to highlight it. I have created a class ( see http://www.inperspective.com/lucene/TokenSources.java ) wh

RE: highlight the search word

2004-08-15 Thread markharw00d
>>I ask because in your example i should calculate the size of the initial text in any >>case. No calculation required - just use a really big number, (eg Integer.MAX_VALUE). This doesn't allocate any extra resources. Your "highlightAll" method sounds like it might be a useful addition. This

RE: highlight the search word

2004-08-15 Thread markharw00d
Hi Pasha, I think the advice you gave is for an earlier version. With the latest version things have moved around and you would have to: //use a max fragment size > size of text to ensure you get all text in one fragment highlighter.setTextFragmenter(new SimpleFragmenter(40)); // c

Re: Highlighter and HTML tags

2004-08-10 Thread markharw00d
The highlighter certainly doesn't support this requirement currently - but it is designed to work with a pluggable choice of Formatter class should you choose to implement this specialized formatting code. The highlighter is typically used to select the "best" sections from a piece of text, to

Re: Negative Boost

2004-08-04 Thread markharw00d
A solution to this has been proposed before - see http://wiki.apache.org/jakarta-lucene/CommunityContributions Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Highlighter package updated with overlapping token support

2004-07-26 Thread markharw00d
I have updated the Highlighter code in CVS to support tokenizers that generate overlapping tokens. The Junit test rig has a new example test that uses a "SynonymTokenizer" which generates multiple tokens in the same position for the same input token eg (the token "football" is expanded into to

Re: Can I retrieve token offsets from Hits?

2004-07-22 Thread markharw00d
> I wonder if the information in termPositions or termVector can be used > to restore token position from indicies? TermFreqVector gives you term frequencies (not positions). This can be of use in computing document similarities. TermPositions gives you the sequence number . eg in the last sente

Re: Can I retrieve token offsets from Hits?

2004-07-21 Thread markharw00d
> I need these values for hihglighting. I've already looked to > Highlighter in sandbox but it actually re-analyzes the original > document's field. Technically not true, as of a few months ago. The good news is the highlighter has been redesigned specifically to use TokenStreams not Analyzers.

Re: Most efficient way to index 14M documents (out of memory/file

2004-07-07 Thread markharw00d
Would it make more sense to use a parameter defining RAM size for the cache rather than minMergeDocs? Tuning RAM usage is the real issue here and controlling this by guessing the number of docs you can squeeze into RAM is not the most helpful approach. How about a "setMaxCacheSize(int megabytes

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread markharw00d
A colleague of mine found the fastest way to index was to use a RAMDirectory, letting it grow to a pre-defined maximum size, then merging it to a new temporary file-based index to flush it. Repeat this, creating new directories for all the file based indexes then perform a merge into one index o

Re: Using Highlighter in web Demo

2004-06-29 Thread markharw00d
>>Is there a way to get the whole document in result. Use one big fragment... highlighter.setTextFragmenter(new SimpleFragmenter(100)); Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-ma

Fix for "advanced tokenizers and highlighter" problem

2004-06-21 Thread markharw00d
I think this version of the highlighter should provide a fix: http://www.inperspective.com/lucene/hilite2beta.zip Before I update the version of the highlighter in the sandbox I'd appreciate feedback from those troubled with the issues to do with overlapping tokens in token streams (Erik, Dave,

Re: amusing interaction between advanced tokenizers and highlighter

2004-06-19 Thread markharw00d
A question before I dive into coding a fix: can I assume (for all analyzers) that the tokens produced by the tokenStream have the following property: currentToken.startOffset() >= lastToken.startOffset() The analyzers I have tested the highlighter with so far have the property: currentTok

Re:amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread markharw00d
Yes, this issue has come up before with other choices of analyzers. I think it should be fixable without changing any of the highlighter APIs - can you email me or post here the source to your analyzer? Cheers Mark - To unsubscr

Re: How to extract matching terms for a document given a query

2004-06-16 Thread markharw00d
Yes, highlighting multi-term queries does require a query.rewrite() call to expand those terms before calling the highlighter. BUT, you could load the results documents into a temporary RAMDirectory and expand the query by rewriting it against THAT instead of the original index - it would still

Re: How to extract matching terms for a document given a query

2004-06-16 Thread markharw00d
>>The reason the current highlighter is not suitable for me, is that the >>content of the document is not stored in the index That shouldn't present a problem. The working code example below was from a recent email discussion I had with someone who was storing text in a database. This simple exam

Re: Distributed searches and RAM Dir

2004-06-04 Thread markharw00d
>> Look up Mark Harwood and Lucene. ..provided some nice sequential >>UML diagrams with notes Those notes went missing recently when the ISP canned my free account. I've resurrected them at my new site here: http://www.inperspective.com/lucene/distrib/index.htm Cheers Mark -

Re: Tool for analyzing analyzers

2004-05-27 Thread markharw00d
Hi Erik, I've had this running OK from the command line and in Eclipse on XP. I suspect it might be because you're running a different OS? The "Classfinder" tries to split the system property "java.class.path" on the ";" character but I forgot different OSes have different seperators. As for Lu

Tool for analyzing analyzers

2004-05-27 Thread markharw00d
I've knocked together this tool which automatically discovers Analyzers on the classpath and provides a GUI to allow you to try out different Analyzers and see their effects: http://www.inperspective.com/lucene/Viewer.zip This needs JDK1.4 and you'll need to define the classpath to include Luce

RE: org.apache.lucene.search.highlight.Highlighter

2004-05-25 Thread markharw00d
>>If the Content is Stored as... >>doc.add(Field.Text("contents", reader)); Thats just it. It's not stored : see the javadocs for Field.text(string,reader): "Constructs a Reader-valued Field that is tokenized and indexed, but is not stored in the index" As opposed to : Field.Text(String name,

Re: Possible to fetch a document without all fields for performance?

2004-05-22 Thread markharw00d
I've put together some code to do this based on this API: Document document(int docNum, String [] fieldNames); You can now be selective about which fields you want to read off disk. It does offer some speed-ups but it is not as fast as it could be due to a limitation in the index file forma

Re: org.apache.lucene.search.highlight.Highlighter

2004-05-21 Thread markharw00d
Hi Claude, that example code you provided is out of date. For all concerned - the highlighter code was refactored about a month ago and then moved into the Sandbox. Want the latest version? - get the latest code from the sandbox CVS. Want the latest docs? - Run javadoc on the above. There is a

Re: org.apache.lucene.search.highlight.Highlighter

2004-05-19 Thread markharw00d
>>Was Investigating,found some Compile time error.. I see the code you have is taken from the example in the javadocs. Unfortunately that example wasn't complete because the class didnt include the method defined in the Formatter interface. I have updated the Javadocs to correct this oversight.

Re: Highlighter package v2 RC1

2004-04-10 Thread markharw00d
>>Can I customize the way it does highlight terms? Right now it does so by arounding >>with . That's the job of a formatter class. You can pass one in the constructor eg: Formatter myFormatter=new SimpleHTMLFormatter("",""); Highlighter h=new Highlighter(myFormatter, new QueryScorer(query))); If

Highlighter package v2 RC1

2004-04-08 Thread markharw00d
I've reworked the highlighter package to address some issues (inability to pass fieldnames to analyzers, limiting tokenization of large docs) and have refactored it to be more modular so that folks can provide alternative implementations of the main functions (tokenizing, fragmenting and scoring

re: Highlight package

2004-04-06 Thread markharw00d
Just got back from holiday to find that one of the ISPs I use shutdown my site with the highlighter code and a number of people have complained about the broken "highlighter" link from the Lucene site. I have temporarily put the missing highlighter code up here: http://www.inperspective.com/luc

Re: Performance of hit highlighting and finding term positions for

2004-04-01 Thread markharw00d
730 msecs is the correct number for 10 * 16k docs with StandardTokenizer! The 11ms per doc figure in my post was for highlighlighting using a \ lower-case-filter-only analyzer. 5ms of this figure was the cost of the \ lower-case-filter-only analyzer. 73 msecs is the cost of JUST StandardTokenizer

Re: Performance of hit highlighting and finding term positions for

2004-03-31 Thread markharw00d
>>Folks have benchmarked this, and, for documents less than 10k characters or so, >>re-tokenizing is fast enough. As a note of warning: I did find StandardTokenizer to be the major culprit in my tokenizing benchmarks (avg 75ms for 16k sized docs). I have found I can live without StandardTokenize

Re: Demoting results

2004-03-29 Thread markharw00d
>>You could, if you fail to find any fragments that match the entire >>query, re-query the fragments with a flattened query containing just an >>OR of all of the original query terms. The other issue with this approach I'm still struggling with is simply the cost of creating the temporary index

Re: Demoting results

2004-03-29 Thread markharw00d
Hi Doug, Thanks for the post. BoostingQuery looks to be cleaner, faster and more generally useful than my implementation :-) Unless anyone has a particularly good reason I'll remove the link to my code that Stephane put on the Wiki contributions page. I definitely find BoostingQuery very useful

Re: Demoting results

2004-03-28 Thread markharw00d
I've found an elegant way of doing this now for all types of search - a new "NegatingQuery" class that takes any Query object in its constructor and selects all documents that DONT match and gives them a user-definable boost. The code is here: http://www.inperspective.com/lucene/demote.zip Chee

Re: Demoting results

2004-03-26 Thread markharw00d
I have not been able to work out how to get custom coordination going to demote results based on a specific term but have an alternative suggestion that looks like it might work: I've created a "MissingTermQuery" - which is the opposite of a TermQuery and can be used to boost documents that DONT

More Like This Query updated plus benchmarks

2004-02-29 Thread markharw00d
I have updated the MoreLikeThis query generator to address a few issues. The code is available here: http://home.clara.net/markharwood/lucene/MoreLikeThis.java I have added comments at the top of the class to describe the changes. I was interested in the benefits of the new TermVector code so I be

RE: Lucene scalability/clustering

2004-02-26 Thread markharw00d
I tend to think of scaling in two dimensions: scaling by volumes of users and scaling by volumes of data. The former is addressed through replicated indexes and the latter by segmented indexes. Distribute replicated segments across multiple boxes and create a broker which a)Determines which segm

Problem using highlighter package

2004-02-18 Thread markharw00d
Hi Alex. Looks to me like you have a classpath problem - you're running with a version other than 1.3 final. Earlier versions of Lucene didn't have the 2 methods in your error messages. You'll need to check your classpath settings carefully. >>I downloaded the highlighter package made available b

Re: MoreLikeThis Query generator - Re: code for "more like this"

2004-02-17 Thread markharw00d
Here's the results of some tests using David's "more like.." class. http://home.clara.net/markharwood/lucene/mlt.htm Looks useful. I have a couple of suggestions in the review. Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTE

RE: Lucene 1.2 "Hit Highlighting"

2003-12-07 Thread markharw00d
Hi Ken, I've just had a look at the compatibility issues of my Highlighter package and Lucene 1.2. It looks like the following Lucene methods are not present in this version: BooleanQuery.getClauses(); PhraseQuery.getTerms() TermQuery.getTerm() and PriorityQueue.insert() However, if you

New highlighter package available

2003-09-24 Thread markharw00d
Details of a new highlighter package are available here: http://home.clara.net/markharwood/lucene/highlight.htm Features include: * Support for highlighting all query types * Support for getting "best fragments" summary from large docs * Works with latest version of Lucene Hope you find this use