Lucene cuts the search results ?
Hi all, I'm quite a newbie for Lucene, but I bought Lucene In Action and I'm trying to customize few examples caught from there. I Have this sample code of JSP (bad JSP caus' I'm also a jsp newbie - :-)) : Here's the code .html head body % long start = new Date().getTime(); Iterator myIterator = vIndexDir.iterator(); while(myIterator.hasNext()) { IndexSearcher searcher = new IndexSearcher((String)myIterator.next()); Query query = new TermQuery(new Term(introduction, queryString)); Hits hits = searcher.search(query); QueryScorer scorer = new QueryScorer(query); Highlighter highlighter = new Highlighter(scorer); % table width=70% cellpadding=2 cellspacing=2 % out.println(trtdhrbr/NUMBER OF MATCHING NEWS FOR \+ (String)myIterator.next() + \ -- +hits.length() + /td/tr); for (int i = 0; i hits.length(); i++) { String introduction = hits.doc(i).get(introduction); TokenStream stream = new SimpleAnalyzer().tokenStream(introduction, new StringReader(introduction)); String fragment = highlighter.getBestFragment(stream, introduction); String pubDate = hits.doc(i).get(pubDate).substring(0, hits.doc(i).get(pubDate).length()-13); String link = hits.doc(i).get(link); float score = hits.score(i); String title = hits.doc(i).get(title); % tr td Scoring : b%=score%/bbr/ %=pubDate + a href=\#\ onClick=\window.open(' + link + ', 'news', 'width=760;height=600')\ + title + /a % br/ %= fragment% br/br/ /td /tr %}% /table % } long end = new Date().getTime(); long interval = end - start; % brbrdiv align=rightbSystem time for query : %= interval% milliseconds/b/div /body /html --- The output is all right, but at the en of this result page, the last hit is cut (I mean for example) : Scoring : 0.9210043 Fri, 28 Jan 2005 - I'm running all this in tomcat 5.0.28 and last nightly fresh build of lucene. So, Could it be a caching problem ? Could this come from JSP or Lucene ? Thanks, and please I do apologise for my poor english ;-) Pierre VANNIER - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene on PersonalJava ?? HELP!
Hi, did anybody here run Lucene 1.3 or 1.2 under PersonalJava (equivalent to JDK 1.1) ? I have a friend who runs Lucene 1.3 under PersonalJava and it works. Mine doesn't. When conmparing the the code I cannot find any difference. I search the index for a Query. I get an error saying that the method java.io.File.createNewFile() is used in Lucene. I have checked Java 1.1.8 and indeed this method does not exist. Beside the question, how it can work on my friends system with the same code, I am asking two more questions: 1) Did anybody here use Lucene on a PDA under Personal Java and can tell some experience? 2) Is there anything else I should try or something I have forgotten? Thanks for your help, Karl -- Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene on PersonalJava ?? HELP!
On Tue, 2005-02-15 at 14:05 +0100, Karl Koch wrote: did anybody here run Lucene 1.3 or 1.2 under PersonalJava (equivalent to JDK 1.1) ? I have a friend who runs Lucene 1.3 under PersonalJava and it works. Mine doesn't. When conmparing the the code I cannot find any difference. I search the index for a Query. I get an error saying that the method java.io.File.createNewFile() is used in Lucene. I have checked Java 1.1.8 and indeed this method does not exist. Beside the question, how it can work on my friends system with the same code, I am asking two more questions: 1) Did anybody here use Lucene on a PDA under Personal Java and can tell some experience? 2) Is there anything else I should try or something I have forgotten? It might be the constructor the the IndexReader or IndexSearcher that you're using. You can pass in a string that points to the directory or a file object instead. Lucene might being using java.io.File.createNewFile() if you pass in a string. A simple grep should find out where it's being used. -- Miles Barr [EMAIL PROTECTED] Runtime Collective Ltd. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene cuts the search results ?
On Tuesday 15 February 2005 09:39, Pierre VANNIER wrote: String fragment = highlighter.getBestFragment(stream, introduction); The highlighter breaks up text into same-size chunks (100 characters by default). If the matching term now appears just at the end or at the start of such a chunk you'll get no context and it looks as if text was cut off. Regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene cuts the search results ?
Thank for reply Daniel, But is there anything to do then to avoid such a thing to happen ? Regards Daniel Naber a écrit : On Tuesday 15 February 2005 09:39, Pierre VANNIER wrote: String fragment = highlighter.getBestFragment(stream, introduction); The highlighter breaks up text into same-size chunks (100 characters by default). If the matching term now appears just at the end or at the start of such a chunk you'll get no context and it looks as if text was cut off. Regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene on PersonalJava ?? HELP!
Hello, thank you for the tip. I have solved the problem in a different way. If anybody else want to run Lucene on PJava, he/she might go for the same. I am using the cvm VM instead of the jeode VM. Then it works fine with Lucene 1.2 withtout any change in my code or in the Lucene code. Perhaps even with a newer version (but havn't tested yet). :-) Thank you anyway, Karl On Tue, 2005-02-15 at 14:05 +0100, Karl Koch wrote: did anybody here run Lucene 1.3 or 1.2 under PersonalJava (equivalent to JDK 1.1) ? I have a friend who runs Lucene 1.3 under PersonalJava and it works. Mine doesn't. When conmparing the the code I cannot find any difference. I search the index for a Query. I get an error saying that the method java.io.File.createNewFile() is used in Lucene. I have checked Java 1.1.8 and indeed this method does not exist. Beside the question, how it can work on my friends system with the same code, I am asking two more questions: 1) Did anybody here use Lucene on a PDA under Personal Java and can tell some experience? 2) Is there anything else I should try or something I have forgotten? It might be the constructor the the IndexReader or IndexSearcher that you're using. You can pass in a string that points to the directory or a file object instead. Lucene might being using java.io.File.createNewFile() if you pass in a string. A simple grep should find out where it's being used. -- Miles Barr [EMAIL PROTECTED] Runtime Collective Ltd. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene cuts the search results ?
Hi Pierre, Here's the response I gave the last time this question was raised:: The highlighter uses a number of pluggable services, one of which is the choice of Fragmenter implementation. This interface is for classes which decide the boundaries where to cut the original text into snippets. The default implementation used simply breaks up text into evenly sized chunks. A more intelligent implementation could be made to detect sentence boundaries. What you are asking for requires that the Fragmenter would know where the upcoming query matches are and decides on fragment boundaries with this in mind. To have this foresight would require a preliminary pass over the TokenStream to identify the match points before calling the highlighter. This Fragmenter implementation does not exist but it does not sound unachievable. I would suggest that some knowledge of sentence boundaries probably would probably help here too. I dont have any plans to write such a Fragmenter now but this is how it could be done. Hope this helps, Cheers, Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Fieldinformation from Index
Hello, I have two questions which might be easy to answer from a Lucene expert: 1) I need to know which fields a collection of Documents has (given the fact that not all docuemnts do necessaryly use all fields). This Documents are all stored in one index. Is there a way (with Lucene 1.2 or 1.3) to find out without going though each document and retrieving it? 2) I need to know which Analyzer was used to index a field. One important rule, as we all know, is to use the same analyzer for indexing and searching a field. Is this information stored in the index or in full responsibilty of the application developer? Karl -- DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen! AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Doug Cutting wrote: Kevin A. Burton wrote: Is there any way to reduce this footprint? The index is fully optimized... I'm willing to take a performance hit if necessary. Is this documented anywhere? You can increase TermInfosWriter.indexInterval. You'll need to re-write the .tii file for this to take effect. The simplest way to do this is to use IndexWriter.addIndexes(), adding your index to a new, empty, directory. This will of course take a while for a 60GB index... (Note... when this works I'll note my findings in a wiki page for future developers) Two more questions: 1. Do I have to do this with a NEW directory? Our nightly index merger uses an existing target index which I assume will re-use the same settings as before? I did this last night and it still seems to use the same amount of memory. Above you assert that I should use a new empty directory and I'll try that tonight. 2. This isn't destructive is it? I mean I'll be able to move BACK to a TermInfosWriter.indexInterval of 128 right? Thanks! Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Kevin A. Burton wrote: 1. Do I have to do this with a NEW directory? Our nightly index merger uses an existing target index which I assume will re-use the same settings as before? I did this last night and it still seems to use the same amount of memory. Above you assert that I should use a new empty directory and I'll try that tonight. You need to re-write the entire index using a modified TermIndexWriter.java. Optimize rewrites the entire index but is destructive. Merging into a new empty directory is a non-destructive way to do this. 2. This isn't destructive is it? I mean I'll be able to move BACK to a TermInfosWriter.indexInterval of 128 right? Yes, you can go back if you re-optimize or re-merge again. Also, there's no need to CC my personal email address. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fieldinformation from Index
On Feb 15, 2005, at 11:45 AM, Karl Koch wrote: 2) I need to know which Analyzer was used to index a field. One important rule, as we all know, is to use the same analyzer for indexing and searching a field. Is this information stored in the index or in full responsibilty of the application developer? The analyzer is not stored in the index, nor its name. I believe this was discussed in the past, though. It's not a rule that the same analyzer be used for both indexing and searching, and there are cases where it makes sense to use different ones. The analyzers must be compatible though. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple Keywords/Keyphrases fields
From: Erik Hatcher [EMAIL PROTECTED] Date: February 12, 2005 3:09:15 PM MST To: Lucene Users List lucene-user@jakarta.apache.org Subject: Re: Multiple Keywords/Keyphrases fields The real question to answer is what types of queries you're planning on making. Rather than look at it from indexing forward, consider it from searching backwards. How will users query using those keyword phrases? Hi Erik. Good point. There are two uses we are making of the keyphrases: - Graphical Navigation: A Flash graphical browser will allow users to fly around in a space of documents, choosing what to be viewing: Authors, Keyphrases and Textual terms. In any of these cases, the closeness of any of the fields will govern how close they will appear graphically. In the case of authors, we will weight collaboration .. how often the authors work together. In the case of Keyphrases, we will want to use something like distance vectors like you show in the book using the cosine measure. Thus the keyphrases need to be separate entities within the document .. it would be a bug for us if the terms leaked across the separate kephrases within the document. - Textual Search: In this case, we will have two ways to search the keyphrases. The first would be like the graphical navigation above where searching for complex system should require the terms to be in a single keyphrase. The second way will be looser, where we may simply pool the keyphrases with titles and abstract, and allow them all to be searched together within the document. Does this make sense? So the question from the search standpoint is: do multiple instances of a field act like there are barriers across the instances, or are they somehow treated as a single instance somehow. In terms of the closeness calculation, for example, can we get separate term vectors for each instance of the keyphrase field, or will we get a single vector combining all the keyphrase terms within a single document? I hope this is clear! Kinda hard to articulate. Owen Erik On Feb 12, 2005, at 3:08 PM, Owen Densmore wrote: I'm getting a bit more serious about the final form of our lucene index. Each document has DocNumber, Authors, Title, Abstract, and Keywords. By Keywords, I mean a comma separated list, each entry having possibly many terms in a phrase like: temporal infomax, finite state automata, Markov chains, conditional entropy, neural information processing I presume I should be using a field Keywords which have many entries or instances per document (one per comma separated phrase). But I'm not sure the right way to handle all this. My assumption is that I should analyze them individually, just as we do for free text (the Abstract, for example), thus in the example above having 5 entries of the nature doc.add(Field.Text(Keywords, finite state automata)); etc, analyzing them because these are author-supplied strings with no canonical form. For guidance, I looked in the archive and found the attached email, but I didn't see the answer. (I'm not concerned about the dups, I presume that is equivalent to a boos of some sort) Does this seem right? Thanks once again. Owen From: [EMAIL PROTECTED] [EMAIL PROTECTED] Subject: Multiple equal Fields? Date: Tue, 17 Feb 2004 12:47:58 +0100 Hi! What happens if I do this: doc.add(Field.Text(foo, bar)); doc.add(Field.Text(foo, blah)); Is there a field foo with value blah or are there two foos (actually not possible) or is there one foo with the values bar and blah? And what does happen in this case: doc.add(Field.Text(foo, bar)); doc.add(Field.Text(foo, bar)); doc.add(Field.Text(foo, bar)); Does lucene store this only once? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]