Re: Alternate Boolean Query Parser?
On Friday 28 March 2003 15:48, Shah, Vineel wrote: > One of my clients is asking for an old-style boolean query search on my > keywords fields. A string might look like this: > > "oracle admin*" and java and oracle and ("8.1.6" or "8.1.7") and > ("solaris" or "unix" or "linux") > > There would probably be need for nested parenthesis, although I can't think > of an example. Is there a parser I can plug into lucene to make this > happen? It doesn't seem like the normal QueryParser class would like this > string, or would it? Any ideas or comments would be appreciated. Making my Actually I think it should, as long as you change 'and' to 'AND' and 'or' to 'OR' (upper case versions are used, I think, to make it less likely user meant to match words 'and' and 'or'?). > own grammar and parser class is too expensive a proposition. Well, writing simple grammar and parser is fairly easy to do, if you've ever used java_cup or javacc (or just (b)yacc / bison), shouldn't take all that long since all actual query classes already exist. But I don't think you need to do even that. :-) The only feature that might need some additional work is matching "oracle admin*"; PhrasePrefixQuery allows doing something like that, but it's not integrated with QueryParser (I think it probably should, and might be quite easy to do). -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Wildcard searching - Case sensitiv?
On Friday 28 March 2003 08:37, [EMAIL PROTECTED] wrote: > Ok, thanks Otis, > > you have to write the terms lowercase when you're searching with wildcards. Or use the set method in QueryParser to ask it to automatically lower case those terms. Patch for that was added before 1.3RC1 (check javadocs or source for exact method to call). I think default was not to enable this feature, for backwards compatibility (unless Otis changed it as was suggested?). -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Alternate Boolean Query Parser?
One of my clients is asking for an old-style boolean query search on my keywords fields. A string might look like this: "oracle admin*" and java and oracle and ("8.1.6" or "8.1.7") and ("solaris" or "unix" or "linux") There would probably be need for nested parenthesis, although I can't think of an example. Is there a parser I can plug into lucene to make this happen? It doesn't seem like the normal QueryParser class would like this string, or would it? Any ideas or comments would be appreciated. Making my own grammar and parser class is too expensive a proposition. Vineel Shah
Re: I: incremental index
I believe it takes constant time to add a new document to an index because when adding a new document a new segment is created on the disk, 'separate' from the other, existing, index segments. The size of the index may come into play when this new segment has to be merged with the existing segments, which happens every mergeFactor documents, so to speak. I have built indices with several hundred thousand documents, but never notices the increase in time to add a new document to an index. Maybe the difference was too small to notice. I don't have sufficient knowledge of Lucene to be able to stand behind this 100% and I could certainly be wrong :(. Otis --- Leo Galambos <[EMAIL PROTECTED]> wrote: > > Adding a new document does not immediately modify an index, so the > time > > it takes to add a new document to an existing index is not > proportional > > to the index size. It is constant. The execution time of > optimize() > > is proportional to the index size, so you want to do that only if > you > > really need it. The Lucene article on http://www.onjava.com/ from > > March 5th describes this in more detail. > > Otis, > > I am not sure, if anything about constants is constant in > non-constant IR > systems :-) > > I think, that the correct answer is O(t/k*(1+log_m(k)), where t is a > time > you need to create&write one monolithic segment of k documents, m is > merge factor you use, and k is the number of documents which are > already > in index. As you can see, the function grows with k. > > Can you explain me, why addition of one document takes constant time? > > Thank you > > -g- > __ Do you Yahoo!? Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! http://platinum.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: I: incremental index
> Adding a new document does not immediately modify an index, so the time > it takes to add a new document to an existing index is not proportional > to the index size. It is constant. The execution time of optimize() > is proportional to the index size, so you want to do that only if you > really need it. The Lucene article on http://www.onjava.com/ from > March 5th describes this in more detail. Otis, I am not sure, if anything about constants is constant in non-constant IR systems :-) I think, that the correct answer is O(t/k*(1+log_m(k)), where t is a time you need to create&write one monolithic segment of k documents, m is merge factor you use, and k is the number of documents which are already in index. As you can see, the function grows with k. Can you explain me, why addition of one document takes constant time? Thank you -g- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Tokenize negative number
You are using an Analyzer that throws out non-alphanumeric characters, StandardAnalyzer most likely. You can create your own Analyzer to do exactly what you want. A sample Analyzer is in the Lucene FAQ at http://jguru.com/ . Otis --- Lixin Meng <[EMAIL PROTECTED]> wrote: > Browsing through some of previous discussion, but I have to say that > I > couldn't find a solution for this. Would you mind provide more clue > on this? > > Regards, > Lixin > > -Original Message- > From: Terry Steichen [mailto:[EMAIL PROTECTED] > Sent: Tuesday, March 25, 2003 7:14 PM > To: Lucene Users List; [EMAIL PROTECTED] > Subject: Re: Tokenize negative number > > > Probably tokenized 1234 as a string and treated '-' as a separator. > See > previous discussion on "query". > > Regards, > > Terry > > - Original Message - > From: "Lixin Meng" <[EMAIL PROTECTED]> > To: "'Lucene Users List'" <[EMAIL PROTECTED]> > Sent: Tuesday, March 25, 2003 9:16 PM > Subject: Tokenize negative number > > > > I have a document with content ' -1234 '. However, after > calling > the > > StandardTokenizer, the token only has '1234' (missed the '-') as > tokeText. > > > > Did anyone experience the similar problem and is there a work > around? > > > > Regards, > > Lixin > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > __ Do you Yahoo!? Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! http://platinum.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Tokenize negative number
Browsing through some of previous discussion, but I have to say that I couldn't find a solution for this. Would you mind provide more clue on this? Regards, Lixin -Original Message- From: Terry Steichen [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 25, 2003 7:14 PM To: Lucene Users List; [EMAIL PROTECTED] Subject: Re: Tokenize negative number Probably tokenized 1234 as a string and treated '-' as a separator. See previous discussion on "query". Regards, Terry - Original Message - From: "Lixin Meng" <[EMAIL PROTECTED]> To: "'Lucene Users List'" <[EMAIL PROTECTED]> Sent: Tuesday, March 25, 2003 9:16 PM Subject: Tokenize negative number > I have a document with content ' -1234 '. However, after calling the > StandardTokenizer, the token only has '1234' (missed the '-') as tokeText. > > Did anyone experience the similar problem and is there a work around? > > Regards, > Lixin > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: I: incremental index
Adding a new document does not immediately modify an index, so the time it takes to add a new document to an existing index is not proportional to the index size. It is constant. The execution time of optimize() is proportional to the index size, so you want to do that only if you really need it. The Lucene article on http://www.onjava.com/ from March 5th describes this in more detail. Otis --- "Rende Francesco, CS" <[EMAIL PROTECTED]> wrote: > Hi, > > I'm a lucene user and i found it a very interesting software. > > > > My question is related to how manage incremental update of the > lucene > index. > > In particular, adding more documents to a big index (~10 Gb) is the > same > of > > creating a new segment and then merge the indexes? > > Adding document to an existing and big index make the retrieve > process > slow? > > > > Which is the best solution about the index performance: > > > > a) create always a new index > > > > b) create the index and then add more documents (with final > optimize) > > > > c) create the index then create a small segment for the new docs > and then > > merge the indexes. > > > > Thanks in advance > > > > F. Rende > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > __ Do you Yahoo!? Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! http://platinum.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Wildcard searching - Case sensitiv?
Ok, thanks Otis, you have to write the terms lowercase when you're searching with wildcards. Otis Gospodnetic yahoo.com> cc: (bcc: Test2 Schwab/MUC/VA/Linde-VA) Subject: Re: Wildcard searching - Case sensitiv? 27.03.03 19:20 Please respond to "Lucene Users List" See FAQ. --- [EMAIL PROTECTED] wrote: > Hi all, > > There ist something I don't understand about the wildcard queries. > I have values like 'REGENERATION GAS DISTRIBUTION' in the table. > when I make a query like descr: Gas I recieve 31 hits. The same bei > query > descr:gas > But when I'm searching for GAS* I don't recieve nothing. > But for gas* I recieve the same hits like gas or GAS. > I use StandardAnalyzer for both indexing and searching. I thougt this > analyzer makes all terms to lowercase? > what have I to do to get the same hits bei upper and lower case > characters > in the query? > > Thx for help > > Arsineh > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > __ Do you Yahoo!? Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! http://platinum.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
I: incremental index
Hi, > I'm a lucene user and i found it a very interesting software. > > My question is related to how manage incremental update of the lucene index. > In particular, adding more documents to a big index (~10 Gb) is the same of > creating a new segment and then merge the indexes? > Adding document to an existing and big index make the retrieve process slow? > > Which is the best solution about the index performance: > > a) create always a new index > > b) create the index and then add more documents (with final optimize) > > c) create the index then create a small segment for the new docs and then > merge the indexes. > > Thanks in advance > > F. Rende - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Very large queries?
how about this: assuming that your taxonomies are tree-like structures, you could expand every term in the documents to be indexed with the path where they belong in the tree (i.e. all hypernyms and hyponyms) - for this you use the same technique as when using thesauri. This will allow you to enter in the query only one node of the taxonomy - from any level - and get back all the records/documents that contain it... -- Alex Murzaku ___ alex(at)lissus.com http://www.lissus.com -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, March 28, 2003 6:49 AM To: Lucene Users List Subject: Re: Very large queries? Thanks for these suggestions. The ideas of adding taxonomy-related terms to the documents is an interesting one and bears some thought. However, if I have to pre-process the corpus to determine which terms to add, and then to add them, it would seem that I've already accomplished my primary goal and don't need an indexer and search engine. Remember: this is not really an information retrieval application (with document-level granularity) that is being contemplated here, but an information extraction and text/data mining application (with "fact-level" granularity). My hope was to leverage a search engine, guided by taxonomies, to accomplish this at least as a first cut. I do find Morus's suggestion to do an "inverse expansion" of terms in the index at indexing time to be very intriguing as well. Perhaps it is also what was meant by Ype's suggestion about adding stuff to the document (meaning adding stuff to the index). It appears I will also need to handle my own identification of matched terms. Verity, too, supports term highlighting -- but I am not at all certain they return information concerning the exact string that triggered the highlighted match. Perhaps if the "inverse expansion" approach can be made to work, it would eliminate this need. And it might also eliminate the need for the very large queries. The details are unclear at this point, but the possibilities are interesting. The suggestion of Jython is also appreciated and I was considering it already. I have not used Jython yet, but have developed all of my ontology/taxonomy/dictionary/thesaurus translation tools in Python (and yes, I do know the differences among all of these). I've even started to develop some of my interface stuff in Tkinter, but if I'm going to go the Java route I'll probably abandon that in favor of Swing. Well, I can see that I have a bit of work to do. I do have an undergraduate and a graduate student here at NC State working with me, and perhaps I can squeeze some of this work out of them :-). -- Gary H. Merrill Director and Principal Scientist, New Applications Data Exploration Sciences GlaxoSmithKline Inc. (919) 483-8456 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Very large queries?
Thanks for these suggestions. The ideas of adding taxonomy-related terms to the documents is an interesting one and bears some thought. However, if I have to pre-process the corpus to determine which terms to add, and then to add them, it would seem that I've already accomplished my primary goal and don't need an indexer and search engine. Remember: this is not really an information retrieval application (with document-level granularity) that is being contemplated here, but an information extraction and text/data mining application (with "fact-level" granularity). My hope was to leverage a search engine, guided by taxonomies, to accomplish this at least as a first cut. I do find Morus's suggestion to do an "inverse expansion" of terms in the index at indexing time to be very intriguing as well. Perhaps it is also what was meant by Ype's suggestion about adding stuff to the document (meaning adding stuff to the index). It appears I will also need to handle my own identification of matched terms. Verity, too, supports term highlighting -- but I am not at all certain they return information concerning the exact string that triggered the highlighted match. Perhaps if the "inverse expansion" approach can be made to work, it would eliminate this need. And it might also eliminate the need for the very large queries. The details are unclear at this point, but the possibilities are interesting. The suggestion of Jython is also appreciated and I was considering it already. I have not used Jython yet, but have developed all of my ontology/taxonomy/dictionary/thesaurus translation tools in Python (and yes, I do know the differences among all of these). I've even started to develop some of my interface stuff in Tkinter, but if I'm going to go the Java route I'll probably abandon that in favor of Swing. Well, I can see that I have a bit of work to do. I do have an undergraduate and a graduate student here at NC State working with me, and perhaps I can squeeze some of this work out of them :-). -- Gary H. Merrill Director and Principal Scientist, New Applications Data Exploration Sciences GlaxoSmithKline Inc. (919) 483-8456 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Very large queries?
Gary, On Thursday 27 March 2003 15:34, [EMAIL PROTECTED] wrote: > Let me describe what the goal is and how I could use Verity to accompish > this -- provided that Verity did not impose the limits it does. > > The documents being indexed are small, completely unstructured, textual > descriptions of adverse events involving one or more drugs, one or more > medical conditions, and potentially other relevant and irrelevant > information. By "small" I mean that they are typically on the order of > several sentences in length. > > Assume that the initial goal is to identify pairwise associations of drugs > and conditions in such documents. Moreover, we would want not only to > identify drug/condition pairs, but more broadly to identify > type-of-drug/type-of-condition pairs. For example, the set of adverse > event reports might not contain a significant number of reports about a > *specific* drug and a *specific* condition -- such as (just as an example) > 'aspirin' and 'blood pressure' -- but it might contain a significant number > of reports about a particular *class* of drugs (therapeutic class or > pharmacological class) and a particular *class* of conditions -- such as > 'beta-blockers' and 'cardiac events'. > > Viewed as an information retrieval problem (not the best way to view it, > but this is just an initial approach), one could then (1) create a taxonomy > of drugs and a taxonomy of conditions, and (2) implement a concept-oriented > (taxonomy-oriented) search of the corpus for something like: > {beta_blocker} AND {cardiac_condition} > where '{beat_blocker}' expands via the taxonomy to the set of terms (words, > sequences of words, etc.) that "fall under" that "concept" in the drug > taxonomy and similarly for '{cardiac_condition}' under the condition > taxonomy. > > A good search engine would then return (for any document in which the query > is matched), the exact string(s) matching the query (e.g., 'thrombosis' or > 'infarction' in the case of '{cardiac_condition}'). That is, from a very > general query (phrased in terms of 'concepts' or 'categories'), you would > get returned associations of specific terms and phrases. Actually, Verity > does this pretty nicely once you transform your taxonomy into a Verity > topic set. Lucene has no direct support for returning actually matched terms. The closest thing is the term highlighting, which is being worked on in the Lucene contributions. Iirc they do it by retrieval of each hit and string searching the query terms in the original text. In case you only need the actually matched terms in a search you'll probably need to extend the search engine to visit the actually matched terms. > You can then take these specific associations you have identified > (retrieved) and see what generalizations they fall under from the point of > view of the taxonomy -- hoping to identify associations between classes of > drugs and classes of conditions. (How you do this, I omit here.) > > Ideally, your initial search should then simply be the most general one > possible -- say of the form: > > {drug} AND {condition} > > (actually, probably not quite this simple; but you get the idea). The > problem is that '{drug}' will expand into (logically) a disjunctive term of > 60,000 subterms, and '{condition}' will likewise expand into a disjunctive > term of multiple thousands of terms. Something logically equivalent to: > > (drug_1 OR drug_2 ... OR drug_6) AND (condition_1 OR > condition_2 OR ... condition_5000) > > Verity's implementation of their query constructor (it generates a machine > to do the matching) imposes a limit of 1,024 child nodes of any single > disjunctive node (roughly speaking) and a collective limit of (16,000/3) > nodes for a topic. Prior to hitting the limit, Verity does just swell. AFAIK Lucene has no builtin limitations on query size. I used queries of several hundreds of terms (no AND operator) on an intermediate size collection of documents (about 1GB index size including stored text, about 100,000 docs, ie. 10kB/doc including index size) and Lucene just did it, although quite a bit slower than for smaller queries. The algorithms used in Lucene are geared towards smaller queries, however. > So, with that much more context, the question can now be rephrased as to > whether there is any problem with Lucene handling queries such as the one > above where there are disjunctive sub-queries with that many terms. > > You can see, I think, that this has nothing to do with categorization (at > least in the usual sense). It is, in fact, an attempt to use a search > engine to accomplish information extraction. I was hoping to do this in > order to get some quick and relatively easy results -- and I could if > Verity didn't have these scaleability problems. The one suggestion I have > seen so far in the responses that seems relevant to the problem is the > suggestion to transform the tax