inter-term correlation [was Re: Vector Space Model in Lucene?]
Incorporating inter-term correlation into Lucene isn't that hard; I've done it. Nor is it incompatible with the vector-space model. I'm not happy with the specific correlation metric that I picked, which is why I'm not eager to generally release the code I wrote, but I think that the basic mechanism that I came up with (query expansion via correlated terms, where the added terms were boosted according to the strength of the correlation) is fairly sound. And I didn't need any changes to Lucene to do this. You can get some details on the specific mechanism that I used here, if you're interested: http://www.ics.uci.edu/~jmadden/research/index.html (and go down to Fuzzy Term Expansion and Document Reweighting, about halfway down.) If you decide that my ideas are interesting enough that you want to have a look at my code, let me know, and perhaps we can work something out. Regards, Joshua O'Madadhain On Friday, Nov 14, 2003, at 09:52 US/Pacific, Chong, Herb wrote: i don't know of any open source search engine that incorporates interterm correlation. i have been looking into how to do this in Lucene and so far, it's not been promising. the indexing engine and file format needs to be changed. there are very few search engines that incorporate interterm correlation in any mathematically and linguistically rigorous manner. i designed a couple, but they were all research experiments. if you are familiar with the TREC automatic adhoc track? my experiments with the TREC-5 to TREC-7 questions produced about 0.05 to 0.10 improvement in average precision by proper use of interterm correlation. my project at the time was cancelled after TREC-7 and so there haven't been any new developments. [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
Not sure what you mean by terms can't cross sentence boundaries. If you're only using single-word terms, that's trivially true. What is it that you're trying to achieve, exactly? (Your comment makes it sound as though you simultaneously want and don't want sentence boundaries, so I'm confused.) Joshua On Friday, Nov 14, 2003, at 10:13 US/Pacific, Chong, Herb wrote: if you didn't have to change the index then you haven't got all the factors needed to do it well. terms can't cross sentence boundaries and the index doesn't store sentence boundaries. Herb... -Original Message- From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 1:14 PM To: Lucene Users List Subject: inter-term correlation [was Re: Vector Space Model in Lucene?] Incorporating inter-term correlation into Lucene isn't that hard; I've done it. Nor is it incompatible with the vector-space model. I'm not happy with the specific correlation metric that I picked, which is why I'm not eager to generally release the code I wrote, but I think that the basic mechanism that I came up with (query expansion via correlated terms, where the added terms were boosted according to the strength of the correlation) is fairly sound. And I didn't need any changes to Lucene to do this. You can get some details on the specific mechanism that I used here, if you're interested: http://www.ics.uci.edu/~jmadden/research/index.html (and go down to Fuzzy Term Expansion and Document Reweighting, about halfway down.) If you decide that my ideas are interesting enough that you want to have a look at my code, let me know, and perhaps we can work something out. Regards, Joshua O'Madadhain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
On Tuesday, Nov 11, 2003, at 11:05 US/Pacific, Marcel Stor wrote: Stefan Groschupf wrote: Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all documents with minimal distance together. Would I be correct to say that you have to define a distance threshold parameter in order to define when to build a new category for a certain group? Depends on the type of clustering algorithm. Some clustering algorithms take the number of clusters as a parameter (in this case the algorithm may be run several times with different values, to determine the best value). Other types of algorithms, such as hierarchical agglomerative clustering algorithms, work more as you suggest. Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Empty phrase search
Minh: Assuming that the fields aren't binary (and thus capable of having any arbitrary string in them), you should be able to come up with a short string that will never appear in the field (emptystring, xyxyqu23, etc.) even if there is no single character that would work as a marker. As an extra layer of insurance, you could throw out any documents whose field only contained that string _as a substring_. This may not be completely bulletproof, but it's pretty close. :) Joshua [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. On Tue, 17 Dec 2002, Minh Kama Yie wrote: Yep, thought of that and having argued it out with the lead on this, it would be suitable in my opinion but even I would concede that it is a hack in our case since there is no limit to the variety of characters that could appear as the value for the field which may be an empty string. Hence there isn't the 100% guarantee that there will never be the circumstance when this special character or String of characters we choose to represent empty fields will appear as a valid value. Thanks anyway though. Minh - Original Message - From: Joshua O'Madadhain [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, December 17, 2002 6:03 PM Subject: Re: Empty phrase search Minh: Why not just use a special character or string (one that won't appear in a non-empty field) to represent an empty field? It doesn't have to appear in the user interface, if that's a concern; you could convert a query including FieldA:NULL (or whatever) into a query containing FieldA:emptyfield automatically. This allows you to finesse the entire issue of adding something to Lucene--which may be for the best anyway, since this is really just a special case of looking for fields whose contents have a specific characteristic. Good luck-- Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. On Tue, 17 Dec 2002, Minh Kama Yie wrote: Thanks for that Peter. Unfortunately I'm not looking for all documents but rather documents where the fields can be empty. Hence using the universal field wouldn't quite work. The empty:true approach is interesting however it effectively doubles the number of indexable fields for a document. Need to assess whether or not we want to support this feature I guess. Currently, considering Lucene's architecture this feature needs to be inherently supported rather than worked around with various fields I think for it to be used/done properly... Would anyone be able to point me in the general direction of where to look in the Lucene code to attempt this? Hopefully this way I can give a proper cost/benefit analysis as to whether or not to support this feature...and if all goes well, release something back for the great work you guys do - Original Message - From: Peter Carlson [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, December 17, 2002 4:25 PM Subject: Re: Empty phrase search I don't think so. One approach to look for everything, or not something is to add a field to each document which is a constant value. Like a field named exists and a value of true. Then you can do search like exists:true NOT microsoft This will find all documents without the term microsoft in them. Just to have it find documents with nothing might be a little tricky. You might want to put a field in the document which indicates the size or something like that. Or just create an empty field and look for empty:true I hope this rambling helps --Peter On Monday, December 16, 2002, at 03:24 PM, Minh Kama Yie wrote: Hi guys, Just wondering if lucene indexes empty strings and if so, how to search for this using the query language? Regards, Minh Kama Yie This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands
Re: Empty phrase search
Minh: Why not just use a special character or string (one that won't appear in a non-empty field) to represent an empty field? It doesn't have to appear in the user interface, if that's a concern; you could convert a query including FieldA:NULL (or whatever) into a query containing FieldA:emptyfield automatically. This allows you to finesse the entire issue of adding something to Lucene--which may be for the best anyway, since this is really just a special case of looking for fields whose contents have a specific characteristic. Good luck-- Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. On Tue, 17 Dec 2002, Minh Kama Yie wrote: Thanks for that Peter. Unfortunately I'm not looking for all documents but rather documents where the fields can be empty. Hence using the universal field wouldn't quite work. The empty:true approach is interesting however it effectively doubles the number of indexable fields for a document. Need to assess whether or not we want to support this feature I guess. Currently, considering Lucene's architecture this feature needs to be inherently supported rather than worked around with various fields I think for it to be used/done properly... Would anyone be able to point me in the general direction of where to look in the Lucene code to attempt this? Hopefully this way I can give a proper cost/benefit analysis as to whether or not to support this feature...and if all goes well, release something back for the great work you guys do - Original Message - From: Peter Carlson [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, December 17, 2002 4:25 PM Subject: Re: Empty phrase search I don't think so. One approach to look for everything, or not something is to add a field to each document which is a constant value. Like a field named exists and a value of true. Then you can do search like exists:true NOT microsoft This will find all documents without the term microsoft in them. Just to have it find documents with nothing might be a little tricky. You might want to put a field in the document which indicates the size or something like that. Or just create an empty field and look for empty:true I hope this rambling helps --Peter On Monday, December 16, 2002, at 03:24 PM, Minh Kama Yie wrote: Hi guys, Just wondering if lucene indexes empty strings and if so, how to search for this using the query language? Regards, Minh Kama Yie This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Accentuated characters
On Tue, 10 Dec 2002, stephane vaucher wrote: I wish to implement a TokenFilter that will remove accentuated characters so for example 'é' will become 'e'. As I would rather not reinvent the wheel, I've tried to find something on the web and on the mailing lists. I saw a mention of a contrib that could do this (see http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html), but I don't see anything applicable. It may depend on what kind of encoding you're working with. (E.g., HTML documents represent such characters in a different way than that of Postscript documents.) Probably the easiest way to handle this, if you want to avoid such questions, would be to convert all your input documents (and query text) to Java (Unicode) strings, and then do a search-and-replace with the appropriate character-pair arguments. (After this is done, you would then do whatever Lucene processing (indexing, query parsing, etc.) was appropriate. I am not aware of any code that does this, but it should be straightforward. Good luck-- Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene Speed under diff JVMs
On Thu, 5 Dec 2002, Armbrust, Daniel C. wrote: I'm using the class that Otis wrote (see message from about 3 weeks ago) for testing the scalability of lucene (more results on that later) and I first tried running it under different versions of Java, to see where it runs the fastest. The class simply creates an index out of randomly generated documents. All of the following were running on a dual CPU 1 GHz PIII Windows 2000 machine that wasn't doing much else during the benchmark. The indexing program was single threaded, so it only used one of the processors of the machine. [snip specific measurements] As you can see, the IBM jvm pretty much smoked Suns. And beat out JRockit as well. Just a hunch, but it wouldn't surprise me if search times were also faster under the IBM jdk. Has anyone else come to this conclusion? Just a brief note on performance measurements and statistical sampling: no offense, but if these are measurements of a single trial of 1000 documents for each JVM, they're not so different that I'd be willing to conclude that one JVM is notably faster for this task than another. The problem is compounded by the fact that it can be hard to tell just how much CPU is being taken up by OS tasks (and this can fluctuate quite a lot). If you really want to quote statistics like this, using 5 or 10 trials would give a more accurate notion of the real performance differences (if any). Casuistically :), Joshua O'Madadhain [EMAIL PROTECTED] Per Obscuriuswww.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for. -- Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: StandardFilter that works for French
On Thu, 21 Nov 2002, Konrad Scherer wrote: In French you have 6 words (me, te, se, le/la , ne, de) where the e is replaced with an apostrophe when the following word starts with a vowel. For example me aider becomes m'aider. Currently Lucene indexes m'aider, s'aider, n'aider as different words when in fact they should be analyzed as me aider, se aider, ne aider, etc. So I modified Standard filter to send back these words as two words. I had to add a one Token buffer. I toyed with modifying StandardTokenizer.jj but I was worried about unintended changes in behavior. This change will not effect English indexing. The only change I can think of is that a word like m'lord would be indexed as me lord. Still it might be better to make a French package and add this to a French Filter. There are a number of contractions in English that could be affected if you're using the apostrophe as a marker, e.g.: isn't, wouldn't, I'd, he's, hasn't. (Granted, these are often considered stop words.) Thus, I think that your idea of incorporating this change into a French filter, rather than modifying Standard filter, is a good idea. Joshua O'Madadhain [EMAIL PROTECTED] Per Obscuriuswww.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for. -- Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Tags Screwing up Searches
On Mon, 21 Oct 2002, Terry Steichen wrote: I discovered that the actual text that I was dealing with already converted the '' converted to 'lt;', and so forth. So the problem is that with something like 'lt;bgt;College Soccerlt;/bgt;', Lucene recognizes the trailing semi-colon ';' as a word separator, so it can find the term 'college', but it does not see the ending of 'soccer'. I did confirm that it *will* match on 'soccerlt;' just fine. I've proceeded to add a string substitution method which replaces 'lt;' with ' ' (four spaces, in order to hopefully keep the offsets straight). It appears to work, though I believe it slows down the indexing. I don't know enough about the inner design of Lucene to figure this out, but it seems logical that there would be a much more efficient way to handle this than string operations. PS: I've had no responses from the list, so perhaps this is a unique problem and doesn't justify a formal fix effort. A few questions and comments; please pardon me if I am asking questions answered in previous email: (1) Are you using an analyzer that is designed to handle (a) HTML, or (b) plain text? (2) If (b), that's probably why you've been getting this kind of behavior, and you may want to look at the HTMLParser sample code in the distribution. The StandardAnalyzer, I'm pretty sure, is not designed to handle HTML. (3) A quick and dirty solution for indexing HTML if you are running on some flavor of Unix and don't want to figure out how to do parse HTML tags: the text web browser lynx. lynx can 'dump' the text from a web page out as follows: cat foo.html | lynx -dump -nolist foo.txt This effectively strips the HTML tags out of foo.html and writes the text of the page to the file foo.txt. Once you've done this, of course, you can use the same analyzers that you use for any unformatted text file. Good luck-- Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: mailto:lucene-user-unsubscribe;jakarta.apache.org For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org
Re: Query modifier ?
On Fri, 27 Sep 2002, Ganael LAPLANCHE wrote: Does anyone know how to implement a query modifier with Lucene ? I would like to be able to use a dictionary to modify an invalid query (just as Google does) and to suggest a new query to the user... This isn't really a Lucene-related question, but if someone could help me... The easiest way--or, at least, the method I used--is to do your own query parsing: (1) take the input from the user as a String (2) do whatever analysis or modifications on it that you like (e.g., breaking it down into words using a StringTokenizer and checking each word, by your favorite method, for misspellings) (3) optionally let user review your modifications (return to (1)) (4) create a Term from each word from the modified query (5) create a Query from your collection of Terms That's the basic idea. The details will depend on what kind of query you want to do, and whether you want to allow the user to specify Boolean modifiers, term boosts, etc. It may be possible to use the standard QueryParser to parse the query and then hack the Query that is returned, but I've never tried it. Good luck-- Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Comparing Intermedia and Lucene
On Wed, 25 Sep 2002, Peter Carlson wrote: I have used intermedia in the past and found that it had a few advantages and disadvantages compared to Lucene. Advantages of Intermedia 4) Supports term expansion and contraction Now items 3 and 4 of Intermedia advantages can be added as Features to Lucene, but are not currently. I'm not sure what you mean by supports term expansion. In the IR community, there are many different mechanisms for term expansion; perhaps in the database community the term is more restrictive. My understanding is that the only thing you need in order to be able to expand terms (or queries) is to be able to modify queries, which you can certainly do under Lucene. (A research project of mine uses Lucene as the core for a search engine that does query expansion, among other things.) Yes, I had to write some code to do this, but static term expansion, in which each term is expanded to a fixed list of other terms (which may be specifiable/modifiable by administrators or users), is pretty straightforward to code (my project used a considerably more sophisticated mechanism). If what you mean is that there is no specific API for term expansion in Lucene, that's true, but I'm not sure how much value such an API would add to Lucene. Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: GoogleQueryParser
This issue confused me somewhat when I first encountered it from the other direction (looking at the inputs to, and behavior of, BooleanQuery). However, ultimately--as I understand it--what's going on here is that while the AND, OR syntax suggests Boolean queries, the indexing and searching engines use the vector model (which ranks documents by their score on the query) rather than the Boolean model (which returns all documents which satisfy the query). It may be that the Boolean syntax is more misleading than useful in some cases. Part of the problem, too (as Peter pointed out), may be that the example queries under discussion are not well-formed Boolean queries, so it's not necessarily clear what their behavior *should* be. Let's consider the example: a b OR c If I assume a default of AND, a AND (b OR c) seems plausible; so does (a AND b) OR c; or perhaps (a AND c) OR (b AND c). I believe that the vector model interpretation of this query may be this (with AND default): +a +b c which means The more of the search terms 'a', 'b', 'c' that a document has, the higher its score will be, all other things being equal; however, if a document doesn't have 'a', or doesn't have 'b', give it score 0 regardless of any other factors. This is probably most similar to (a AND c) OR (b AND c). The Boolean interpretation of (a AND c) OR (b AND c) would be Return all documents [in any order] that have *at least one of the following* term sets: {'a', 'c'}; {'b', 'c'}. which, if you think about it, gives you the same document set as the vector model interpretation. Personally, I avoid using the QueryParser entirely and just do my own parsing and query construction. Part of the reason for this is that my code is doing term expansion and reweighting, but part of it is just that I feel that I get more power and flexibility--and less opportunity for ambiguity such as this--by doing my own parsing. Your mileage may vary. :) Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. On Thu, 12 Sep 2002, John Cwikla wrote: Actually to expand a little more after a little more digging, it appears that the AND/OR terms are being flattened into a list of + - or optional query terms that are used to remove/add results one after the other. In this sense, I think the AND, OR and NOT operators seems to have been an afterthought to the queryparser, since they cannot give the same results as +, - and optional. AND, OR and NOT should use intersections and unions, while + and - is doing strict adds or rejections. I guess I was expecting a syntax tree, but it looks like just flattening of terms. cwikla -Original Message- From: John Cwikla [mailto:[EMAIL PROTECTED]] Sent: Thursday, September 12, 2002 11:53 AM To: 'Lucene Users List' Subject: RE: GoogleQueryParser So here is the problem, AND default operator: a b OR c == a AND b OR c == c OR (a AND b) == c OR (+a +b) != c b +a != c +b +a However with default OR operator: a AND b c == a AND b OR c == c OR (a AND b) == c OR (+a +b) == c (+a +b) Since AND and OR do not actually mean required or optional in a strict boolean sense, I claim you cannot correctly use the query parser with a default AND operator and get results that would be expected. I haven't looked more into the QueryParser yet, but in the last case with the AND operator, if at some point the internal query switched to OR, then the last item would be correct if it had parenthesis like c (+a +b) Or is it early and I'm missing something? cwikla -Original Message- From: Halácsy Péter [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 11:58 PM To: Lucene Users List Subject: RE: GoogleQueryParser -Original Message- From: Philip Chan [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 11, 2002 11:04 PM To: Lucene Users List Subject: RE: GoogleQueryParser I think there's a bug, if I set the default operator to be OR, when I run java org.apache.lucene.queryParser.QueryParser a AND b OR c it will give me the result of +a +b c if I set the default operator to be AND, and run it with the term a b OR c, it will give me +a b c, which is different To be exact it's not a bug, it's feature ;) Well, the structured query language of Lucene (and Google and others) is not a strict boolean language. For example I think the QueryParser of Lucene do not support parenthesis: a AND (b OR C) Instead of strict boolean logic it supports constraint on query terms: a query term is either required or optional or prohibited. If you write + sign before the term it will be required. If you write - it will be prohibited. The question is: is a term
Re: Newbie quizzes further...
On Mon, 2 Sep 2002, Stone, Timothy wrote: I have noted that Lucene fails to interpret numerous HTML entities, specifically entities in the 82xx range, i.e. #8212; (en-dash) and many others. Now this may not be a Lucene issue, I'm looking at the code as I post, but I'm curious to its origins and why they don't seem to be parsed properly in the index. As I see it, there are two answers to this question. (1) What gets parsed and indexed is your choice; there are various different Analyzers that are included with the Lucene package, which have different effects. You could conceivably construct an Analyzer that would parse such entities as you describe. (2) Historically punctuation has not been parsed by search engines, for the simple reason that it doesn't tend to add much to search precision and it complicates the indexing process. (On the other hand, if you're talking about accents and non-English letters, I understand that some people have written analyzers that cover these things; check out the contrib section on the Lucene website.) Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: text format and scoring
On Sat, 3 Aug 2002, petite_abeille wrote: I was wandering what would be a good way to incorporate text format information in Lucene word/document scoring. For example, when turning HTML into plain text for indexing purpose, a lot of potentially useful information are lost: eg tags like bold, strong and so on could be understood as conveying emphasis information about some words. If somebody took the pain to underline some words, why throw it away? Assuming there is some interesting meaning in a document format/layout, and a way to understand it and weight it, how could one incorporate this information into document scoring? If you can boost terms as they are indexed (I can't remember if this is possible, but you can certainly do so on queries) then that might be a good way of doing it; it's not so much a matter of changing document scores (on the back end, with respect to a particular query) as it is of changing the weighting of terms (on the front end). I've just glanced through the API and I don't see a way to do term boosting during indexing, but maybe there's something I've missed. Anyone? Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Wrong spelling
On Wed, 24 Jul 2002, Olivier Amira wrote: I would like to implement in my Lucene application a google's like feature like the Did you mean google's feature. So, when the user enters a wrong spelling of a word, the search engine automatically propose a similar better word. To implement such function in a Lucene application, I'm not sure of what method is the best (or it's correct to try to di this with a Lucene index). Is there anybody that could help-me for this? There are a couple of different approaches to this that I'm aware of. (1) Find a list of commonly misspelled words, detect them in a query, and prompt the user with the corresponding correctly spelled words. Such lists are pretty common. Advantages: reasonably easy to implement, computationally cheap, and most of the work (figuring out what words to flag and what words to suggest in their place) is done statically. Disadvantages: it will catch 'speling' mistakes but not 'spellling' mistakes (that is, it will only recognize errors that you tell it about). This is entirely independent of the index unless you go to the trouble of removing entries from this auxiliary data structure that correspond to words that aren't in the index anyway. (2) There's something in the Lucene API docs about a FuzzyQuery that mentions Levenshtein distance (= string edit distance, I believe). I haven't looked into this myself, but I would guess that you should be able to construct a FuzzyQuery that specifies a maximum string edit distance between a specified search term and other terms in the index. Unfortunately, the API docs are just about that helpful; FuzzyTermEnum has more information but doesn't tell you how to use FuzzyQuery. On the other hand at least you now know where to look in the source code. :) Advantages: more flexible, seems like it's built in; disadvantages: docs not helpful, will probably slow your query down more than (1) would. You could also try to write your own string edit distance calculator/data structure, but I don't have any quick answers as to how to do that. Good luck-- Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: contains
On Tue, 16 Jul 2002, Lothar Simon wrote: Just to correct a few points: - The factor would be 2 * (average no of chars per word)/2 = (average no of chars per word). Actually, I made a mistake in my earlier analysis, but your factor is also inaccurate. Ignoring other factors and overhead, storing just the set of strings S = {s_1, s_2, ..., s_n} with corresponding lengths L = {l_1, l_2, ..., l_n} requires (sum_i l_i) [sum of all lengths in L] = (mean(L) * n) space [where mean(L) = the above sum divided by the size of L (n)]. Storing all prefixes of each string requires (sum_i sum_j j) [where j varies from 1 to l_i] = (sum_i (l_i(l_i + 1))/2) 1/2(sum_i (l_i^2)). = (sum_i (l_i^2)) for both prefixes and suffixes != (l_i * sum_i l_i) [because you can't take the extra factor of l_i outside of the sum] The mistake that you and I both made is this: the sum of the squares of the mean lengths is not the same as the mean length squared times n. E.g.: L = {2,3,4,6,7,8} [mean 5] sum_i l_i = 30 2 * (sum_i sum_j j) = 262; 262/30 = 8.73 [correct multiplicative factor] 2 * mean(L)^2 = 180; 180/30 = 6 [this is *wrong*] and the original hypothesis was that the additional factor should have been the mean, 5. Some additional calculations suggest that if you assume an exponential distribution of word lengths (cf. Zipf's Law) that basing your guess on a mean word length will cause you to underestimate by a factor of something like 17%. This is all just a fancy way of demonstrating that having long strings hurts you more than having short ones helps you, i.e., using a mean value in place of the sum is inaccurate in general. - One would probably create a set of 2 * (maximum number of chars per word) as Fields for a document. If this could work was actually my question... - Most important: my proposal is exactly (and almost only) designed to solve the substring (*uti*) problem !!! One field in the first group of fields in my example contains utiful and would be found by uti*, a field in the other group of fields contains itueb and would be found by itu*. Voila! You are correct; I goofed. I still think my idea would work (given you spend the space for the index). I still don't see how you deal with the problem that I mentioned before: 'Another problem with this is that in order to be able to get from ful to beautiful, you have to store, in the index entry for ful, (pointers to) every single complete word in your document set that contains ful as a prefix or suffix. Just _creating_ such an index would be extremely time-consuming even with clever data structures, and consider how much extra storage for pointers would be necessary for entries like e or n.' In any case, I personally would consider the expected overhead of space to be prohibitive. However, so long as you address the remaining issue I just mentioned, yes, I think that your scheme would work. Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: contains
On Fri, 12 Jul 2002, Lothar Simon wrote: [in response to Peter Carlson pointing out that searching for *xyz* is a difficult problem] Of course you are right. And I am surely more the last then the first one to try to come up with THE solution for this. But still... Could the following work? If space (ok, a lot) is available you could store beutiful, eutiful, utiful, tiful, iful, ful, ul, l PLUS its inversions (lufitueb, ufitueb, fitueb, itueb, tueb, ueb, eb, b) in the index. Space needed would be something like (average no of chars per word) as much as in a normal index. Actually it would be twice that, because you're storing backward and forward versions. I'd hazard a guess that this factor alone would mean something like a 10- or 12-fold increase in index size (the average length of a word is less than 5 or 6 letters, but by throwing out stop words you throw out a lot of the words that drag the average down). Another problem with this is that in order to be able to get from ful to beautiful, you have to store, in the index entry for ful, (pointers to) every single complete word in your document set that contains ful as a substring. Just _creating_ such an index would be extremely time-consuming even with clever data structures, and consider how much extra storage for pointers would be necessary for entries like e or n. Finally, you're not including all substrings: your scheme doesn't allow me to search for *uti* and find beautiful. If you did, the number of entries would then be multiplied by a factor of the _square_ of the average number of characters per word. (You might be able to avoid this by doing prefix and suffix searches--which are difficult but less so--on the strings you specify, though.) There might be some clever way to get around these problems, but I suspect that developing one would be a dissertation topic. :) Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: contains
Pradeep: I think what Peter was trying to get at was the question when is it useful for a search engine user to be able to search for words that contain a particular letter? For a language like Chinese, it would certainly be useful to be able to search for a single character. However, the informational content of a single letter in an alphabet-based language (such as English) is so low that I have trouble believing that it would be useful to be able to do this kind of search. That is to say: unless this feature has been presented to you as a requirement, you may want to think about how it might be used in practice before you spend a lot of time implementing it. Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. On Thu, 11 Jul 2002, Pradeep Kumar K wrote: Hi Peter I want to include an option called contains in my search application. for example: Name contains 'p' like that... Thanks for reply. Pradeep Peter Carlson wrote: Do you really want to be able to find items by letter? Do you have some other purpose that tokenizing by letter is trying to get around. If your do want to tokenize by letter, you can create your own tokenizer which creates breaks up items by letter. See the current tokenizers under org.apache.lucene.analysis. --Peter On 7/10/02 10:26 AM, Pradeep Kumar K [EMAIL PROTECTED] wrote: Is it possible to search for a word contains some letters? example : God is love how can I create query to search for sentences having d. I found that lucene is tokenizing a sentence in to words not into letters. is it possible using lucene? Can anybody give a clue for this? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Combining queries using OR
On Sun, 9 Jun 2002, Pradeep Kumar K wrote: How to combine more queries using 'OR' operator. For example Two queries are Query quer1=QueryParser.parse(String query1,String field1,analyzer) Query quer2=QueryParser.parse(String query2,String field2,analyzer) How can I combine these two queries using 'OR' operator i.e. ultimately query should be one like quer1 OR quer2 Take a look at BooleanQuery. You want to add quer1 and quer2 to a BooleanQuery such that each is neither prohibited nor required (this will make more sense once you see the interface). Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscuriuswww.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for. -- Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Normalization of Documents
from Bernhard Messer: Let me know if you find that idea interessting, i would like to work on that topic. Yup, me too. This is germane to my research as well. Joshua [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Googlifying lucene querys
You cannot, in general, structure a Lucene query such that it will yield the same document rankings that Google would for that (query, document set). The reason for this is that Google employs a scoring algorithm that includes information about the topology of the pages (i.e., how the pages are linked together). (An overview of what Google does in this regard may be found at http://www.google.com/technology/index.html .) Thus, in order to get Lucene to do what Google does, you'd have to rewrite large chunks of it. Joshua [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. On Mon, 25 Feb 2002, Spencer, Dave wrote: I'm pretty sure google gives priority to the words appearing in the title and URL. I believe sect 4.2.5 says this here: http://citeseer.nj.nec.com/cache/papers/cs/13017/http:zSzzSzwww-db.stanf ord.eduzSzpubzSzpaperszSzgoogle.pdf/brin98anatomy.pdf from here: http://citeseer.nj.nec.com/brin98anatomy.html So you have to have Lucene store the title as a separate field. This is then what you'd have if like me you boost (the caret is boost) the title by *5 and the URL by *2: +(title:george^5.0 url:george^2.0 contents:george) +(title:bush^5.0 url:bush^2.0 contents:bush) +(title:white^5.0 url:white^2.0 contents:white) +(title:house^5.0 url:house^2.0 contents:house) -Original Message- From: Ian Lea [mailto:[EMAIL PROTECTED]] Sent: Saturday, February 23, 2002 8:15 AM To: Lucene Users List Subject: Re: Googlifying lucene querys +george +bush +white +house -- Ian. Jari Aarniala wrote: Hello, Despite of the confusing subject ;) my question is simple. I'm just trying out Lucene for the first time and would like to know how one would go on implementing the search on the index with the same logic that Google uses. For example, if the user input is george bush white house, how do I easily construct a query that searches ALL of the words above? If I have understood correctly, passing the search string above to the queryParser creates a query that search for ANY of the words above. Thanks for any help, -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene Query Structure
Actually, Winton's suggestion doesn't work because it's inconsistent with the syntax of BooleanQuery() (the constructor doesn't take arguments, and add() takes one Query argument, not two). After considerable study of the documentation, I am still confused about the semantics of BooleanQuery. I think I can answer the original syntactic question posed by sjb, but the overall motivation escapes me. I believe the correct syntax is (given TermQueries a, b, c, and d): // create a BooleanQuery for '(a AND b)' BooleanQuery bq_ab = new BooleanQuery(); bq_ab.add(a, true, false); bq_ab.add(b, true, false); // same as above with c and d BooleanQuery bq_cd = new BooleanQuery(); bq_cd.add(c, true, false); bq_cd.add(d, true, false); // join two BooleanQueries together BooleanQuery bq_abcd = new BooleanQuery(); bq_abcd.add(bq_ab, false, false); bq_abcd.add(bq_cd, false, false); Now, as sjb pointed out, (query, false, false) doesn't really seem to have the semantics of a boolean OR. In particular: (1) It's a unary operator: add() adds a Query (or a BooleanClause) to a BooleanQuery. OR is a binary operator. (2) add.(query, false, false) adds a query Q whose satisfaction is *irrelevant* to that of the resultant BooleanQuery BQ: documents are neither required to satisfy Q, nor required to *not* satisfy Q, in order for BQ to be satisfied. The semantics of a boolean OR should be that at least *one* of the components (queries) must be satisfied in order for the entire expression (composite query) to be satisfied. I conclude that either (a) I simply don't understand the proper use of BooleanQuery, or (b) BooleanQuery cannot be used to express a boolean OR. If (b) is true, either the semantics of BooleanQuery need revising, or it needs to get called something else so as not to be confusing. If the semantics of BooleanQuery are revised, I would suggest changing the syntax as well to reflect the binary nature of Boolean operators: BooleanQuery.add(Query q1, Query q2, boolean and) which would equate to (q1 AND q2) if 'and' is true, otherwise (q1 OR q2). If there are any papers or references which would explain this better, that would be great. Otherwise, I would really appreciate it if one of the developers would take a few moments to clarify this issue. I'm trying to use Lucene as a platform for research in IR; to do this I need a clear understanding of the exact definition of the scoring system, especially as it relates to the semantics of queries involving multiple terms. Once I get a clear understanding of this issue, I would be happy to write it up and submit it as an addition to the FAQ/docs. Thanks in advance for any assistance rendered in getting this sorted out. Regards, Joshua On Mon, 18 Feb 2002, Winton Davies wrote: BQ(Term, Term, Include, Exclude) BQ( BQ(a,b,true,false) BQ(a,b,true,false) false, false) Should work... Winton Lets say I have two queries which I want to combine into one: (a and b) OR (c and d) I would use QueryParser.parse to form the subqueries, but how do I/can I combine them with the OR logic? The BooleanQuery can be used to piece queries together but it does not support OR's correct? (It only supports include, exclude, and preferential.) Thanks for any help, sjb [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Qs re: document scoring and semantics
(1) The FAQ states the following: For the record, Lucene's scoring algorithm is, roughly: score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d where: score_d : score for document d sum_t : sum for all terms t tf_q : the square root of the frequency of t in the query tf_d : the square root of the frequency of t in d idf_t : log(numDocs/docFreq_t+1) + 1.0 numDocs : number of documents in index docFreq_t : number of documents containing t norm_q: sqrt(sum_t((tf_q*idf_t)^2)) norm_d_t : square root of number of tokens in d in the same field as t boost_t: the user-specified boost for term t coord_q_d : number of terms in both query and document / number of terms in query Is either of the expressions below the correct parenthesization of the expression above? If not, what is? score_d = sum_t(tf_q * (idf_t / norm_q) * tf_d * (idf_t / norm_d_t) * boost_t) * coord_q_d score_d = sum_t((tf_q * idf_t) / (norm_q * tf_d * idf_t) / (norm_d_t * boost_t)) * coord_q_d (2) I'm trying to make sure that I have a handle on the semantics of BooleanQuery, especially as they relate to the scoring mechanism. I would appreciate it if someone would correct any misapprehensions in the following descriptions. * BooleanQuery.add(query, false, false) is equivalent to Boolean OR. Only one such query must be satisfied in a given document D (via the appearance of the associated term(s)/phrase(s) in D) in order for the score for D to be 0. * BooleanQuery.add(query, true, false) is equivalent to Boolean AND. All such queries must be satisfied (as above) in D in order for score(D) to be 0. * BooleanQuery.add(query, false, true) is equivalent to Boolean NAND. All such queries must be *not* satisfied in D in order for score(D) to be 0. If these, and the semantics for required and prohibited (in BooleanQuery.add()), are accurate, then the semantics seem rather odd to me, so I'm hoping that someone will tell me that I'm wrong. :) In particular, it seems to me that if you create a BooleanQuery and add a single TermQuery tq to it with add(tq, false, false) then, according to the semantics of required and prohibited, *any* document will match the query...which clearly doesn't make sense. (3) Somewhat unrelated question: what are the semantics and purpose of FilteredTermEnum.difference()? (I see where and how it's used in the source but I don't understand the motivation.) (4) I'm still somewhat puzzled by MultiTermQuery. I believe that Lucene essentially uses the vector model to identify (and quantify) query-document matching. A vector model query consists of one or more terms, so I would expect MultiTermQuery to be a common query type. However, the documentation doesn't make it clear how one binds several terms together into a MultiTermQuery, and in particular I don't see how one could separately set the boost for different terms in a MultiTermQuery. In my current prototype system, I've been using BooleanQuery to bind my terms together into a single query, and I'm not sure that the semantics for BooleanQuery is what I want (hence question (2) above). Could someone please explain what MultiTermQuery is for, how it should be used, etc.? Thanks-- Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
number of terms vs. number of fields
I have been experimenting with indexing a document set with different sets of fields. Specifically, I start out with a contents field that is a concatenation of all the elements of the original document in which I'm interested. This gets me an index with about 7500 unique terms (which I determine by opening up an IndexReader, extracting the terms in the index, and counting them). Then I've been adding each of the separate elements (title, major subject, minor subject, abstract/extract), one at a time, to the index (by recreating the index). Because contents is the concatenation of the other fields (title, major, minor, abstract/extract), I would expect that the number of unique terms in the index would not change if I added the other fields into the index; each term should just have twice the frequency as if I only used the contents field. However, this is not what's happening; in fact, if I add all the other fields in, the total number of unique terms is 22000+. I have verified that contents contains everything that the other fields do, so I am quite puzzled by this. Any idea what's going on here, and why? For anyone who might be interested in checking this out, my code is below. Regards, Joshua // FileCFDocument.java (a modified version of FileDocument.java in the // source examples) import java.io.File; import java.io.FileInputStream; import java.util.Vector; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; /** A utility for making Lucene Documents from a File. */ public class FileCFDocument { public static Document[] makeDocuments(File f) throws java.io.FileNotFoundException, java.io.IOException { // open file, read it into a byte array and thence a String FileInputStream fis = new FileInputStream(f); int n = fis.available(); byte[] data = new byte[n]; fis.read(data); fis.close(); String s = new String(data); int ti, so, mj, mn, ab, ex, rf, abex; Vector vdocs = new Vector(); String contents; // fields being indexed: // TI (title) // MJ (major subject) // MN (minor subject) // AB/EX (abstract/extract) ti = s.indexOf(\nTI ); while (ti != -1) { // make a new, empty document Document doc = new Document(); int k = s.indexOf(\nPN ); doc.add(Field.UnIndexed(number, s.substring(k+4, k+9))); // DEBUG System.out.println(s.substring(k, k+9)); s = s.substring(ti+4); // DEBUG System.out.println(s.length(): + s.length()); so = s.indexOf(\nSO ); mj = s.indexOf(\nMJ ); mn = s.indexOf(\nMN ); ab = s.indexOf(\nAB ); ex = s.indexOf(\nEX ); rf = s.indexOf(\nRF ); //System.out.println(so: + so + , mj: + mj + , mn: // + mn + //, ab: + ab + , ex: + ex + , rf: + rf); String title = s.substring(0, so); doc.add(Field.Text(title, title)); contents = title; if (mj != -1 mj mn) // not all documents have major subject { String major = s.substring(mj+4, mn); //doc.add(Field.Text(major, major)); contents = contents + + major; } if (ab != -1 ab rf) // if this document has an abstract { abex = ab; String abs = s.substring(ab+4, rf); //doc.add(Field.Text(abstract, abs)); contents = contents + + abs; } else // it has an extract instead { abex = ex; String extract = s.substring(ex+4, rf); //doc.add(Field.Text(extract, extract)); contents = contents + + extract; } if (mn != -1 mn abex) { String minor = s.substring(mn+4, abex); //doc.add(Field.Text(minor, minor)); contents = contents + + minor; } // add a field that's the concatenation of the others so that // we can search on all fields simultaneously doc.add(Field.Text(contents, contents)); // DEBUG //System.out.println(contents); System.out.println(doc.toString()); ti = s.indexOf(\nTI ); vdocs.add(doc); } Document[] docs = new Document[vdocs.size()]; vdocs.toArray(docs); return docs; } private FileCFDocument() {} } // IndexCFFiles.java (a modification of the IndexFiles.java example) import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.document.Document; import java.io.File; import java.util.Date; // DEBUG import