Re: Alternate Boolean Query Parser?

2003-03-28 Thread Tatu Saloranta
On Friday 28 March 2003 15:48, Shah, Vineel wrote:
> One of my clients is asking for an old-style boolean query search on my
> keywords fields. A string might look like this:
>
>   "oracle admin*" and java and oracle and ("8.1.6" or "8.1.7") and
> ("solaris" or "unix" or "linux")
>
> There would probably be need for nested parenthesis, although I can't think
> of an example. Is there a parser I can plug into lucene to make this
> happen? It doesn't seem like the normal QueryParser class would like this
> string, or would it? Any ideas or comments would be appreciated. Making my

Actually I think it should, as long as you change 'and' to 'AND' and 'or' to 
'OR' (upper case versions are used, I think, to make it less likely user 
meant to match words 'and' and 'or'?).

> own grammar and parser class is too expensive a proposition.

Well, writing simple grammar and parser is fairly easy to do, if you've ever 
used java_cup or javacc (or just (b)yacc / bison), shouldn't take all that 
long since all actual query classes already exist. But I don't think you need 
to do even that. :-)

The only feature that might need some additional work is matching "oracle 
admin*"; PhrasePrefixQuery allows doing something like that, but it's not 
integrated with QueryParser (I think it probably should, and might be quite 
easy to do).

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Wildcard searching - Case sensitiv?

2003-03-28 Thread Tatu Saloranta
On Friday 28 March 2003 08:37, [EMAIL PROTECTED] wrote:
> Ok, thanks Otis,
>
> you have to write the terms lowercase when you're searching with wildcards.

Or use the set method in QueryParser to ask it to automatically lower case
those terms. Patch for that was added before 1.3RC1 (check javadocs or source 
for exact method to call). I think default was not to enable this feature, 
for backwards compatibility (unless Otis changed it as was suggested?).

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Alternate Boolean Query Parser?

2003-03-28 Thread Shah, Vineel
One of my clients is asking for an old-style boolean query search on my keywords 
fields. A string might look like this:

"oracle admin*" and java and oracle and ("8.1.6" or "8.1.7") and ("solaris" or 
"unix" or "linux")

There would probably be need for nested parenthesis, although I can't think of an 
example. Is there a parser I can plug into lucene to make this happen? It doesn't seem 
like the normal QueryParser class would like this string, or would it? Any ideas or 
comments would be appreciated. Making my own
grammar and parser class is too expensive a proposition.

Vineel Shah



Re: I: incremental index

2003-03-28 Thread Otis Gospodnetic
I believe it takes constant time to add a new document to an index
because when adding a new document a new segment is created on the
disk, 'separate' from the other, existing, index segments.
The size of the index may come into play when this new segment has to
be merged with the existing segments, which happens every mergeFactor
documents, so to speak.
I have built indices with several hundred thousand documents, but never
notices the increase in time to add a new document to an index.  Maybe
the difference was too small to notice.  I don't have sufficient
knowledge of Lucene to be able to stand behind this 100% and I could
certainly be wrong :(.

Otis


--- Leo Galambos <[EMAIL PROTECTED]> wrote:
> > Adding a new document does not immediately modify an index, so the
> time
> > it takes to add a new document to an existing index is not
> proportional
> > to the index size.  It is constant.  The execution time of
> optimize()
> > is proportional to the index size, so you want to do that only if
> you
> > really need it.  The Lucene article on http://www.onjava.com/ from
> > March 5th describes this in more detail.
> 
> Otis,
> 
> I am not sure, if anything about constants is constant in
> non-constant IR 
> systems :-)
> 
> I think, that the correct answer is O(t/k*(1+log_m(k)), where t is a
> time
> you need to create&write one monolithic segment of k documents, m is
> merge factor you use, and k is the number of documents which are
> already
> in index. As you can see, the function grows with k.
> 
> Can you explain me, why addition of one document takes constant time?
> 
> Thank you
> 
> -g-

> 


__
Do you Yahoo!?
Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!
http://platinum.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: I: incremental index

2003-03-28 Thread Leo Galambos
> Adding a new document does not immediately modify an index, so the time
> it takes to add a new document to an existing index is not proportional
> to the index size.  It is constant.  The execution time of optimize()
> is proportional to the index size, so you want to do that only if you
> really need it.  The Lucene article on http://www.onjava.com/ from
> March 5th describes this in more detail.

Otis,

I am not sure, if anything about constants is constant in non-constant IR 
systems :-)

I think, that the correct answer is O(t/k*(1+log_m(k)), where t is a time
you need to create&write one monolithic segment of k documents, m is
merge factor you use, and k is the number of documents which are already
in index. As you can see, the function grows with k.

Can you explain me, why addition of one document takes constant time?

Thank you

-g-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Tokenize negative number

2003-03-28 Thread Otis Gospodnetic
You are using an Analyzer that throws out non-alphanumeric characters,
StandardAnalyzer most likely.
You can create your own Analyzer to do exactly what you want.  A sample
Analyzer is in the Lucene FAQ at http://jguru.com/ .

Otis



--- Lixin Meng <[EMAIL PROTECTED]> wrote:
> Browsing through some of previous discussion, but I have to say that
> I
> couldn't find a solution for this. Would you mind provide more clue
> on this?
> 
> Regards,
> Lixin
> 
> -Original Message-
> From: Terry Steichen [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 25, 2003 7:14 PM
> To: Lucene Users List; [EMAIL PROTECTED]
> Subject: Re: Tokenize negative number
> 
> 
> Probably tokenized 1234 as a string and treated '-' as a separator. 
> See
> previous discussion on "query".
> 
> Regards,
> 
> Terry
> 
> - Original Message -
> From: "Lixin Meng" <[EMAIL PROTECTED]>
> To: "'Lucene Users List'" <[EMAIL PROTECTED]>
> Sent: Tuesday, March 25, 2003 9:16 PM
> Subject: Tokenize negative number
> 
> 
> > I have a document with content ' -1234 '. However, after
> calling
> the
> > StandardTokenizer, the token only has '1234' (missed the '-') as
> tokeText.
> >
> > Did anyone experience the similar problem and is there a work
> around?
> >
> > Regards,
> > Lixin
> >
> >
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> >
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!
http://platinum.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Tokenize negative number

2003-03-28 Thread Lixin Meng
Browsing through some of previous discussion, but I have to say that I
couldn't find a solution for this. Would you mind provide more clue on this?

Regards,
Lixin

-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED]
Sent: Tuesday, March 25, 2003 7:14 PM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: Re: Tokenize negative number


Probably tokenized 1234 as a string and treated '-' as a separator.  See
previous discussion on "query".

Regards,

Terry

- Original Message -
From: "Lixin Meng" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Tuesday, March 25, 2003 9:16 PM
Subject: Tokenize negative number


> I have a document with content ' -1234 '. However, after calling
the
> StandardTokenizer, the token only has '1234' (missed the '-') as tokeText.
>
> Did anyone experience the similar problem and is there a work around?
>
> Regards,
> Lixin
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: I: incremental index

2003-03-28 Thread Otis Gospodnetic
Adding a new document does not immediately modify an index, so the time
it takes to add a new document to an existing index is not proportional
to the index size.  It is constant.  The execution time of optimize()
is proportional to the index size, so you want to do that only if you
really need it.  The Lucene article on http://www.onjava.com/ from
March 5th describes this in more detail.

Otis


--- "Rende Francesco, CS" <[EMAIL PROTECTED]> wrote:
> Hi,
> > I'm a lucene user and i found it a very interesting software.
> > 
> > My question is related to how manage incremental update of the
> lucene
> index.
> > In particular, adding more documents to a big index (~10 Gb) is the
> same
> of
> > creating a new segment and then merge the indexes?
> > Adding document to an existing and big index make the retrieve
> process
> slow?
> > 
> > Which is the best solution about the index performance:
> > 
> > a) create always a new index
> > 
> > b) create the index and then add more documents (with final
> optimize)
> > 
> > c) create the index then create a small segment for the new docs
> and then
> > merge the indexes.
> > 
> > Thanks in advance
> > 
> > F. Rende
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!
http://platinum.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Wildcard searching - Case sensitiv?

2003-03-28 Thread Test2 . Schwab

Ok, thanks Otis,

you have to write the terms lowercase when you're searching with wildcards.




   

Otis Gospodnetic   


yahoo.com>   cc: (bcc: Test2 
Schwab/MUC/VA/Linde-VA)   
 Subject: Re: Wildcard searching - 
Case sensitiv?  
27.03.03 19:20 

Please respond to  

"Lucene Users  

List"  

   

   





See FAQ.

--- [EMAIL PROTECTED] wrote:
> Hi all,
>
> There ist something I don't understand about the wildcard queries.
> I have  values like 'REGENERATION GAS DISTRIBUTION' in the table.
> when I make a query like descr: Gas I recieve 31 hits. The same bei
> query
> descr:gas
> But when I'm searching for GAS* I don't recieve nothing.
> But for gas* I recieve the same hits like gas or GAS.
> I use StandardAnalyzer for both indexing and searching. I thougt this
> analyzer makes all terms to lowercase?
> what have I to do to get the same hits bei upper and lower case
> characters
> in the query?
>
> Thx for help
>
> Arsineh
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


__
Do you Yahoo!?
Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!
http://platinum.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



I: incremental index

2003-03-28 Thread Rende Francesco, CS
Hi,
> I'm a lucene user and i found it a very interesting software.
> 
> My question is related to how manage incremental update of the lucene
index.
> In particular, adding more documents to a big index (~10 Gb) is the same
of
> creating a new segment and then merge the indexes?
> Adding document to an existing and big index make the retrieve process
slow?
> 
> Which is the best solution about the index performance:
> 
> a) create always a new index
> 
> b) create the index and then add more documents (with final optimize)
> 
> c) create the index then create a small segment for the new docs and then
> merge the indexes.
> 
> Thanks in advance
> 
> F. Rende


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Very large queries?

2003-03-28 Thread Alex Murzaku
how about this:
assuming that your taxonomies are tree-like structures, you could expand
every term in the documents to be indexed with the path where they
belong in the tree (i.e. all hypernyms and hyponyms) - for this you use
the same technique as when using thesauri. This will allow you to enter
in the query only one node of the taxonomy - from any level - and get
back all the records/documents that contain it...


-- 
Alex Murzaku
___
 alex(at)lissus.com  http://www.lissus.com

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 28, 2003 6:49 AM
To: Lucene Users List
Subject: Re: Very large queries?


Thanks for these suggestions.  The ideas of adding taxonomy-related
terms to the documents is an interesting one and bears some thought.
However, if I have to pre-process the corpus to determine which terms to
add, and then to add them, it would seem that I've already accomplished
my primary goal and don't need an indexer and search engine.  Remember:
this is not really an information retrieval application (with
document-level granularity) that is being contemplated here, but an
information extraction and text/data
mining application (with "fact-level" granularity).   My hope was to
leverage a search engine, guided by taxonomies, to accomplish this at
least as a first cut.

I do find Morus's suggestion to do an "inverse expansion" of terms in
the index at indexing time to be very intriguing as well.  Perhaps it is
also what was meant by Ype's suggestion about adding stuff to the
document (meaning adding stuff to the index).

It appears I will also need to handle my own identification of matched
terms.  Verity, too, supports term highlighting -- but I am not at all
certain they return information concerning the exact string that
triggered the highlighted match.  Perhaps if the "inverse expansion"
approach can be made to work, it would eliminate this need.  And it
might also eliminate the need for the very large queries.  The details
are unclear at this point, but the possibilities are interesting.

The suggestion of Jython is also appreciated and I was considering it
already.  I have not used Jython yet, but have developed all of my
ontology/taxonomy/dictionary/thesaurus translation tools in Python (and
yes, I do know the differences among all of these).  I've even started
to develop some of my interface stuff in Tkinter, but if I'm going to go
the Java route I'll probably abandon that in favor of Swing.

Well, I can see that I have a bit of work to do.  I do have an
undergraduate and a graduate student here at NC State working with me,
and perhaps I can squeeze some of this work out of them :-).

--
Gary H. Merrill
Director and Principal Scientist, New Applications
Data Exploration Sciences
GlaxoSmithKline Inc.
(919) 483-8456




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Very large queries?

2003-03-28 Thread gary.h.merrill
Thanks for these suggestions.  The ideas of adding taxonomy-related terms
to the documents is an interesting one and bears some thought.  However, if
I have to pre-process the corpus to determine which terms to add, and then
to add them, it would seem that I've already accomplished my primary goal
and don't need an indexer and search engine.  Remember:  this is not really
an information retrieval application (with document-level granularity) that
is being contemplated here, but an information extraction and text/data
mining application (with "fact-level" granularity).   My hope was to
leverage a search engine, guided by taxonomies, to accomplish this at least
as a first cut.

I do find Morus's suggestion to do an "inverse expansion" of terms in the
index at indexing time to be very intriguing as well.  Perhaps it is also
what was meant by Ype's suggestion about adding stuff to the document
(meaning adding stuff to the index).

It appears I will also need to handle my own identification of matched
terms.  Verity, too, supports term highlighting -- but I am not at all
certain they return information concerning the exact string that triggered
the highlighted match.  Perhaps if the "inverse expansion" approach can be
made to work, it would eliminate this need.  And it might also eliminate
the need for the very large queries.  The details are unclear at this
point, but the possibilities are interesting.

The suggestion of Jython is also appreciated and I was considering it
already.  I have not used Jython yet, but have developed all of my
ontology/taxonomy/dictionary/thesaurus translation tools in Python (and
yes, I do know the differences among all of these).  I've even started to
develop some of my interface stuff in Tkinter, but if I'm going to go the
Java route I'll probably abandon that in favor of Swing.

Well, I can see that I have a bit of work to do.  I do have an
undergraduate and a graduate student here at NC State working with me, and
perhaps I can squeeze some of this work out of them :-).

--
Gary H. Merrill
Director and Principal Scientist, New Applications
Data Exploration Sciences
GlaxoSmithKline Inc.
(919) 483-8456




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Very large queries?

2003-03-28 Thread Ype Kingma
Gary,

On Thursday 27 March 2003 15:34, [EMAIL PROTECTED] wrote:
> Let me describe what the goal is and how I could use Verity to accompish
> this -- provided that Verity did not impose the limits it does.
>
> The documents being indexed are small, completely unstructured, textual
> descriptions of adverse events involving one or more drugs, one or more
> medical conditions, and potentially other relevant and irrelevant
> information.  By "small" I mean that they are typically on the order of
> several sentences in length.
>
> Assume that the initial goal is to identify pairwise associations of drugs
> and conditions in such documents.  Moreover, we would want not only to
> identify drug/condition pairs, but more broadly to identify
> type-of-drug/type-of-condition pairs.  For example, the set of adverse
> event reports might not contain a significant number of reports about a
> *specific* drug and a *specific* condition -- such as (just as an example)
> 'aspirin' and 'blood pressure' -- but it might contain a significant number
> of reports about a particular *class* of drugs (therapeutic class or
> pharmacological class) and a particular *class* of conditions -- such as
> 'beta-blockers' and 'cardiac events'.
>
> Viewed as an information retrieval problem (not the best way to view it,
> but this is just an initial approach), one could then (1) create a taxonomy
> of drugs and a taxonomy of conditions, and (2) implement a concept-oriented
> (taxonomy-oriented) search of the corpus for something like:
> {beta_blocker} AND {cardiac_condition}
> where '{beat_blocker}' expands via the taxonomy to the set of terms (words,
> sequences of words, etc.) that "fall under" that "concept" in the drug
> taxonomy and similarly for '{cardiac_condition}' under the condition
> taxonomy.
>
> A good search engine would then return (for any document in which the query
> is matched), the exact string(s) matching the query (e.g., 'thrombosis' or
> 'infarction' in the case of '{cardiac_condition}').  That is, from a very
> general query (phrased in terms of 'concepts' or 'categories'), you would
> get returned associations of specific terms and phrases.  Actually, Verity
> does this pretty nicely once you transform your taxonomy into a Verity
> topic set.

Lucene has no direct support for returning actually matched terms.
The closest thing is the term highlighting, which is being worked on
in the Lucene contributions. Iirc they do it by retrieval of each hit
and string searching the query terms in the original text.
In case you only need the actually matched terms in a search you'll
probably need to extend the search engine to visit the actually
matched terms.

> You can then take these specific associations you have identified
> (retrieved) and see what generalizations they fall under from the point of
> view of the taxonomy -- hoping to identify associations between classes of
> drugs and classes of conditions.  (How you do this, I omit here.)
>
> Ideally, your initial search should then simply be the most general one
> possible -- say of the form:
>
> {drug} AND {condition}
>
> (actually, probably not quite this simple; but you get the idea).  The
> problem is that '{drug}' will expand into (logically) a disjunctive term of
> 60,000 subterms, and '{condition}' will likewise expand into a disjunctive
> term of multiple thousands of terms.  Something logically equivalent to:
>
>   (drug_1 OR drug_2 ... OR drug_6) AND (condition_1 OR
> condition_2 OR ... condition_5000)
>
> Verity's implementation of their query constructor (it generates a machine
> to do the matching) imposes a limit of 1,024 child nodes of any single
> disjunctive node (roughly speaking) and a collective limit of (16,000/3)
> nodes for a topic.  Prior to hitting the limit, Verity does just swell.

AFAIK Lucene has no builtin limitations on query size. I used queries
of several hundreds of terms (no AND operator) on an intermediate size 
collection of documents (about 1GB index size including stored text, about 
100,000 docs, ie. 10kB/doc including index size) and Lucene just did it, 
although quite a bit slower than for smaller queries.
The algorithms used in Lucene are geared towards smaller queries, however.

> So, with that much more context, the question can now be rephrased as to
> whether there is any problem with Lucene handling queries such as the one
> above where there are disjunctive sub-queries with that many terms.
>
> You can see, I think, that this has nothing to do with categorization (at
> least in the usual sense).  It is, in fact, an attempt to use a search
> engine to accomplish information extraction.  I was hoping to do this in
> order to get some quick and relatively easy results -- and I could if
> Verity didn't have these scaleability problems.  The one suggestion I have
> seen so far in the responses that seems relevant to the problem is the
> suggestion to transform the tax