Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Joshua O'Madadhain
t store sentence boundaries. Herb... -Original Message- From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 1:14 PM To: Lucene Users List Subject: inter-term correlation [was Re: Vector Space Model in Lucene?] Incorporating inter-term correlation into L

inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Joshua O'Madadhain
know, and perhaps we can work something out. Regards, Joshua O'Madadhain On Friday, Nov 14, 2003, at 09:52 US/Pacific, Chong, Herb wrote: i don't know of any open source search engine that incorporates interterm correlation. i have been looking into how to do this in Lucene and so far

Re: Document Clustering

2003-11-11 Thread Joshua O'Madadhain
orithm may be run several times with different values, to determine the best value). Other types of algorithms, such as hierarchical agglomerative clustering algorithms, work more as you suggest. Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Jos

Re: Very large queries?

2003-03-27 Thread Joshua O'Madadhain
any idea of what the performance would be like > in retrieving via such queries? I do not have experience with such queries, so I can't speak to that question directly. However, I don't understand what the purpose of such a query would be in the first place. What are the documents

Re: information

2003-03-10 Thread Joshua O'Madadhain
er.cgi Joshua [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightfu

Re: Empty phrase search

2002-12-17 Thread Joshua O'Madadhain
work as a marker. As an extra layer of insurance, you could throw out any documents whose field only contained that string _as a substring_. This may not be completely bulletproof, but it's pretty close. :) Joshua [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O&#x

Re: Empty phrase search

2002-12-16 Thread Joshua O'Madadhain
) into a query containing "FieldA:emptyfield" automatically. This allows you to finesse the entire issue of adding something to Lucene--which may be for the best anyway, since this is really just a special case of looking for fields whose contents have a specific characteristic. Good luck-- Joshua

Re: Accentuated characters

2002-12-10 Thread Joshua O'Madadhain
ir arguments. (After this is done, you would then do whatever Lucene processing (indexing, query parsing, etc.) was appropriate. I am not aware of any code that does this, but it should be straightforward. Good luck-- Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jma

Re: Lucene Speed under diff JVMs

2002-12-05 Thread Joshua O'Madadhain
than another. The problem is compounded by the fact that it can be hard to tell just how much CPU is being taken up by OS tasks (and this can fluctuate quite a lot). If you really want to quote statistics like this, using 5 or 10 trials would give a more accurate notion of the real performance dif

Re: StandardFilter that works for French

2002-11-21 Thread Joshua O'Madadhain
re a number of contractions in English that could be affected if you're using the apostrophe as a marker, e.g.: isn't, wouldn't, I'd, he's, hasn't. (Granted, these are often considered stop words.) Thus, I think that your idea of incorporating this change into a French f

RE: Indexing synonyms

2002-11-11 Thread Joshua O'Madadhain
r 'spicy', or 'attractive', or ...) is not nearly as strong as the connections going the other direction. You also can get problems with homonyms like 'minute' (time period) and 'minute' (very small); clearly these two demand different classes of related terms.

Re: Indexing synonyms

2002-11-10 Thread Joshua O'Madadhain
, and add them to the query. I would guess that this would be fast enough for your purposes, is more flexible (in case you want to expand or contract your notion of a synonym), and requires no additional index space. If you have something else in mind, what is it? Regards, Joshua O'Madadhain

Re: definite matching

2002-10-23 Thread Joshua O'Madadhain
rld > > will find all documents with the term hello and not world. > Note: You cannot use the - option alone. > > Also you can use NOT in the same way > > hello NOT world > > results in > > hello -world > > > Finally the OR operator (the current default) op

Re: Tags Screwing up Searches

2002-10-21 Thread Joshua O'Madadhain
hen return the (cleaned-up) HTML later when asked for? The basis of any 'semantic' tags that you might be putting in the XML (perhaps to define Lucene fields) must be there in the HTML anyway, so I'm not sure what the DOM and XML representations get you. Regards, Joshua O'

Re: Tags Screwing up Searches

2002-10-21 Thread Joshua O'Madadhain
nx can 'dump' the text from a web page out as follows: cat foo.html | lynx -dump -nolist > foo.txt This effectively strips the HTML tags out of foo.html and writes the text of the page to the file foo.txt. Once you've done this, of course, you can use the same analyzers th

Re: Query modifier ?

2002-09-27 Thread Joshua O'Madadhain
ll depend on what kind of query you want to do, and whether you want to allow the user to specify Boolean modifiers, term boosts, etc. It may be possible to use the standard QueryParser to parse the query and then hack the Query that is returned, but I've never tried it. Good luck-- Joshua O

Re: Comparing Intermedia and Lucene

2002-09-25 Thread Joshua O'Madadhain
..www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: <m

Re: Comparing Intermedia and Lucene

2002-09-25 Thread Joshua O'Madadhain
that there is no specific API for term expansion in Lucene, that's true, but I'm not sure how much value such an API would add to Lucene. Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosop

Re: delete / optimize question

2002-09-17 Thread Joshua O'Madadhain
ou call docFreq()? (close() does flush changes, although I don't know whether it should be necessary after optimize().) Anyway, good luck. Joshua [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It

RE: GoogleQueryParser

2002-09-12 Thread Joshua O'Madadhain
ally, I avoid using the QueryParser entirely and just do my own parsing and query construction. Part of the reason for this is that my code is doing term expansion and reweighting, but part of it is just that I feel that I get more power and flexibility--and less opportunity for ambiguity such as

Re: Quotes in keyword field searches

2002-09-10 Thread Joshua O'Madadhain
ike your document is "hello world" and your query string is "goodbye everyone". Under those circumstances (no overlap of index and query) I'd expect 0 hits. Good luck-- Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O

Re: Newbie quizzes further...

2002-09-02 Thread Joshua O'Madadhain
other hand, if you're talking about accents and non-English letters, I understand that some people have written analyzers that cover these things; check out the contrib section on the Lucene website.) Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua

Re: text format and scoring

2002-08-02 Thread Joshua O'Madadhain
changing document scores (on the back end, with respect to a particular query) as it is of changing the weighting of terms (on the front end). I've just glanced through the API and I don't see a way to do term boosting during indexing, but maybe there's something I've missed. Anyo

Re: Wrong spelling

2002-07-24 Thread Joshua O'Madadhain
ring edit distance calculator/data structure, but I don't have any quick answers as to how to do that. Good luck-- Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that m

RE: contains

2002-07-22 Thread Joshua O'Madadhain
t _creating_ such an index would be extremely time-consuming even with clever data structures, and consider how much extra storage for pointers would be necessary for entries like "e" or "n".' In any case, I personally would consider the expected overhead of space to be prohib

Re: CachedSearcher

2002-07-16 Thread Joshua O'Madadhain
one package?) If nothing else, such inclusion might be somewhat mysterious to later maintainers of that code. This kind of modification might also make it more difficult for people to get Lucene contributions from more than one source to work together. Regards, Joshua O'Madadha

RE: contains

2002-07-12 Thread Joshua O'Madadhain
and find "beautiful". If you did, the number of entries would then be multiplied by a factor of the _square_ of the average number of characters per word. (You might be able to avoid this by doing prefix and suffix searches--which are difficult but less so--on the strings you specify, t

Re: contains

2002-07-10 Thread Joshua O'Madadhain
hink about how it might be used in practice before you spend a lot of time implementing it. Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning

RE: LIFO or FIFO??

2002-07-08 Thread Joshua O'Madadhain
ww.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. On Mon, 8 Jul 2002, Samir Satam wrote: &

Re: Combining queries using OR

2002-06-09 Thread Joshua O'Madadhain
make more sense once you see the interface). Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscuriuswww.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for. -- Bill Watte

Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException)

2002-04-29 Thread Joshua O'Madadhain
On Mon, 29 Apr 2002, petite_abeille wrote: > As a final note, several people suggested to increase the number of > file descriptors per process with something like "ulimit"... From what > I learned today, I think it's a *bad* idea to have to change some > system parameters just because your/my ap

Re: Normalization of Documents

2002-04-16 Thread Joshua O'Madadhain
from Bernhard Messer: > > Let me know if you find that idea interessting, i would like to work on > > that topic. Yup, me too. This is germane to my research as well. Joshua [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua Madden: Information Scientist, Musician, Philoso

RE: Case Sensitivity

2002-04-03 Thread Joshua O'Madadhain
Alan, Aruna: The built-in solution is to use LowerCaseFilter in your Analyzer. (The SimpleAnalyzer, StopAnalyzer, and StandardAnalyzer classes already do this; see the Lucene API docs to see which filters each uses.) The FAQ includes an example implementation of an Analyzer if you want to build

Re: Relevance Feedback

2002-03-29 Thread Joshua O'Madadhain
nks > > --Peter > > On 3/29/02 12:11 PM, "Joshua O'Madadhain" <[EMAIL PROTECTED]> wrote: > > > > > While I did weight documents based on the terms they used (take a look at > > TermQuery.setBoost()), I didn't do relevance feedback per se. Of

Re: Relevance Feedback

2002-03-29 Thread Joshua O'Madadhain
On Fri, 29 Mar 2002, Nathan G. Freier wrote: > I'm a graduate student in the Information School at the University of > Washington. I'm currently in the process of developing a prototype > online IR system and I have been making use of Lucene's API. I'm just > beginning to plan out some mecha

Re: Need pointers on using a very small part of Lucene

2002-03-14 Thread Joshua O'Madadhain
gt; 'deploi', 'deploying' to 'deploy', etc. You want the PorterStemFilter (what you're talking about is 'stemming', and the Porter stemmer is a specific popular instance of such). See the Lucene FAQ section 2 #23 for info on Porter stemming, and #17

Re: What type of indexer is Lucene? Question reworded.

2002-03-07 Thread Joshua O'Madadhain
Melissa: These questions are answered in the Lucene FAQ, which is located at http://www.lucene.com/cgi-bin/faq/faqmanager.cgi However, if I correctly understand your fundamental question, my understanding is that Lucene basically uses the vector model of IR. Joshua [EMAIL PROTECTED] Per Obs

RE: Googlifying lucene querys

2002-02-25 Thread Joshua O'Madadhain
On Mon, 25 Feb 2002, Doug Cutting wrote: > > From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]] > > > > You cannot, in general, structure a Lucene query such that it > > will yield > > the same document rankings that Google would for that (query, documen

RE: Googlifying lucene querys

2002-02-25 Thread Joshua O'Madadhain
You cannot, in general, structure a Lucene query such that it will yield the same document rankings that Google would for that (query, document set). The reason for this is that Google employs a scoring algorithm that includes information about the topology of the pages (i.e., how the pages are l

Re: Lucene Query Structure

2002-02-19 Thread Joshua O'Madadhain
Actually, Winton's suggestion doesn't work because it's inconsistent with the syntax of BooleanQuery() (the constructor doesn't take arguments, and add() takes one Query argument, not two). After considerable study of the documentation, I am still confused about the semantics of BooleanQuery. I

Qs re: document scoring and semantics

2002-02-16 Thread Joshua O'Madadhain
e question (2) above). Could someone please explain what MultiTermQuery is for, how it should be used, etc.? Thanks-- Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall It's th

number of terms vs. number of fields

2001-12-01 Thread Joshua O'Madadhain
I have been experimenting with indexing a document set with different sets of fields. Specifically, I start out with a "contents" field that is a concatenation of all the elements of the original document in which I'm interested. This gets me an index with about 7500 unique terms (which I determ

multiple-term queries and term numbers

2001-12-01 Thread Joshua O'Madadhain
kup tables by arrays of arrays, represent them as hash tables (keyed by some munging of the string which represents the term) of arrays. Thanks in advance for any assistance. Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua Madden: Inform

Re: compiling example code

2001-11-17 Thread Joshua O'Madadhain
[I'm taking the liberty of redirecting part of a conversation on Lucene that I took off-list back on-list, since I think it's become generally relevant.] On Fri, 16 Nov 2001, Steven J. Owens wrote: > > ...I still think it's easier for a project > > consisting of three files to just compile the d

Re: Sorting Options for Query Results

2001-11-16 Thread Joshua O'Madadhain
On Fri, 16 Nov 2001, Jeff Kunkle wrote: > Hello. Does anyone know of a way to sort search results other than by > score? It seems like it would be very useful to be able to sort by > date or maybe even by any field that has been indexed (which I guess > would include a date). From what I can t

extracting information from an index

2001-11-15 Thread Joshua O'Madadhain
ource to start hacking on would also be appreciated. Thanks in advance for any help that may be offered. Regards, Joshua O'Madadhain (Madden) [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall It's that

RE: compiling example code

2001-11-13 Thread Joshua O'Madadhain
On Tue, 13 Nov 2001, Alex Murzaku wrote: > Are you using ant? By just using "ant demo" from the lucene root > directory everything goes fine. Make sure you have the latest ant > (1.4). I have no idea what ant is or what it's supposed to be for; I'll do a web search to see if I can get one (an id

compiling example code

2001-11-13 Thread Joshua O'Madadhain
I am attempting to build the example code which is located in the \lucene-1.2-rc2\src\demo\org\apache\lucene directory in the distribution. Specifically, I'm trying to get IndexFiles.java, SearchFiles.java, and FileDocument.java to compile, as a sanity check, before trying to go any further.