t store sentence boundaries.
Herb...
-Original Message-
From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 1:14 PM
To: Lucene Users List
Subject: inter-term correlation [was Re: Vector Space Model in Lucene?]
Incorporating inter-term correlation into L
know, and perhaps we can work something
out.
Regards,
Joshua O'Madadhain
On Friday, Nov 14, 2003, at 09:52 US/Pacific, Chong, Herb wrote:
i don't know of any open source search engine that incorporates
interterm correlation. i have been looking into how to do this in
Lucene and so far
orithm may be run several times with different values, to determine
the best value). Other types of algorithms, such as hierarchical
agglomerative clustering algorithms, work more as you suggest.
Regards,
Joshua O'Madadhain
[EMAIL PROTECTED] Per
Obscurius...www.ics.uci.edu/~jmadden
Jos
any idea of what the performance would be like
> in retrieving via such queries?
I do not have experience with such queries, so I can't speak to that
question directly. However, I don't understand what the purpose of such a
query would be in the first place. What are the documents
er.cgi
Joshua
[EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightfu
work as a marker.
As an extra layer of insurance, you could throw out any documents whose
field only contained that string _as a substring_.
This may not be completely bulletproof, but it's pretty close. :)
Joshua
[EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua O
) into a query containing
"FieldA:emptyfield" automatically.
This allows you to finesse the entire issue of adding something to
Lucene--which may be for the best anyway, since this is really just a
special case of looking for fields whose contents have a specific
characteristic.
Good luck--
Joshua
ir arguments. (After
this is done, you would then do whatever Lucene processing (indexing,
query parsing, etc.) was appropriate. I am not aware of any code that
does this, but it should be straightforward.
Good luck--
Joshua O'Madadhain
[EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jma
than another. The problem is
compounded by the fact that it can be hard to tell just how much CPU is
being taken up by OS tasks (and this can fluctuate quite a lot). If you
really want to quote statistics like this, using 5 or 10 trials would give
a more accurate notion of the real performance dif
re a number of contractions in English that could be affected if
you're using the apostrophe as a marker, e.g.: isn't, wouldn't, I'd, he's,
hasn't. (Granted, these are often considered stop words.) Thus, I think
that your idea of incorporating this change into a French f
r
'spicy', or 'attractive', or ...) is not nearly as strong as the
connections going the other direction. You also can get problems with
homonyms like 'minute' (time period) and 'minute' (very small); clearly
these two demand different classes of related terms.
, and add them to the query. I would guess that this would be fast
enough for your purposes, is more flexible (in case you want to expand or
contract your notion of a synonym), and requires no additional index
space.
If you have something else in mind, what is it?
Regards,
Joshua O'Madadhain
rld
>
> will find all documents with the term hello and not world.
> Note: You cannot use the - option alone.
>
> Also you can use NOT in the same way
>
> hello NOT world
>
> results in
>
> hello -world
>
>
> Finally the OR operator (the current default) op
hen return the
(cleaned-up) HTML later when asked for? The basis of any 'semantic' tags
that you might be putting in the XML (perhaps to define Lucene fields)
must be there in the HTML anyway, so I'm not sure what the DOM and XML
representations get you.
Regards,
Joshua O'
nx can 'dump' the text from a web
page out as follows:
cat foo.html | lynx -dump -nolist > foo.txt
This effectively strips the HTML tags out of foo.html and writes the text
of the page to the file foo.txt.
Once you've done this, of course, you can use the same analyzers th
ll depend on what kind of query you
want to do, and whether you want to allow the user to specify Boolean
modifiers, term boosts, etc.
It may be possible to use the standard QueryParser to parse the query and
then hack the Query that is returned, but I've never tried it.
Good luck--
Joshua O
..www.ics.uci.edu/~jmadden
Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.
--
To unsubscribe, e-mail: <m
that there is no specific API for term expansion in
Lucene, that's true, but I'm not sure how much value such an API would add
to Lucene.
Joshua O'Madadhain
[EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua O'Madadhain: Information Scientist, Musician, Philosop
ou call docFreq()? (close() does flush changes, although I don't know
whether it should be necessary after optimize().)
Anyway, good luck.
Joshua
[EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It
ally, I avoid using the QueryParser entirely and just do my own
parsing and query construction. Part of the reason for this is that my
code is doing term expansion and reweighting, but part of it is just that
I feel that I get more power and flexibility--and less opportunity for
ambiguity such as
ike your document is "hello
world" and your query string is "goodbye everyone". Under those
circumstances (no overlap of index and query) I'd expect 0 hits.
Good luck--
Joshua O'Madadhain
[EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua O
other hand, if you're
talking about accents and non-English letters, I understand that some
people have written analyzers that cover these things; check out the
contrib section on the Lucene website.)
Joshua O'Madadhain
[EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua
changing document
scores (on the back end, with respect to a particular query) as it is of
changing the weighting of terms (on the front end).
I've just glanced through the API and I don't see a way to do term
boosting during indexing, but maybe there's something I've missed.
Anyo
ring edit distance calculator/data
structure, but I don't have any quick answers as to how to do that.
Good luck--
Joshua O'Madadhain
[EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that m
t _creating_ such an index would be extremely
time-consuming even with clever data structures, and consider how much
extra storage for pointers would be necessary for entries like "e" or
"n".'
In any case, I personally would consider the expected overhead of space to
be prohib
one package?) If nothing else, such inclusion
might be somewhat mysterious to later maintainers of that code. This kind
of modification might also make it more difficult for people to get Lucene
contributions from more than one source to work together.
Regards,
Joshua O'Madadha
and find "beautiful". If you did, the number of
entries would then be multiplied by a factor of the _square_ of the
average number of characters per word. (You might be able to avoid this
by doing prefix and suffix searches--which are difficult but less so--on
the strings you specify, t
hink about how it might be used in practice
before you spend a lot of time implementing it.
Regards,
Joshua O'Madadhain
[EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning
ww.ics.uci.edu/~jmadden
Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.
On Mon, 8 Jul 2002, Samir Satam wrote:
&
make more sense once you see the interface).
Regards,
Joshua O'Madadhain
[EMAIL PROTECTED] Per Obscuriuswww.ics.uci.edu/~jmadden
Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for. -- Bill Watte
On Mon, 29 Apr 2002, petite_abeille wrote:
> As a final note, several people suggested to increase the number of
> file descriptors per process with something like "ulimit"... From what
> I learned today, I think it's a *bad* idea to have to change some
> system parameters just because your/my ap
from Bernhard Messer:
> > Let me know if you find that idea interessting, i would like to work on
> > that topic.
Yup, me too. This is germane to my research as well.
Joshua
[EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philoso
Alan, Aruna:
The built-in solution is to use LowerCaseFilter in your Analyzer. (The
SimpleAnalyzer, StopAnalyzer, and StandardAnalyzer classes already do
this; see the Lucene API docs to see which filters each uses.) The FAQ
includes an example implementation of an Analyzer if you want to build
nks
>
> --Peter
>
> On 3/29/02 12:11 PM, "Joshua O'Madadhain" <[EMAIL PROTECTED]> wrote:
>
> >
> > While I did weight documents based on the terms they used (take a look at
> > TermQuery.setBoost()), I didn't do relevance feedback per se. Of
On Fri, 29 Mar 2002, Nathan G. Freier wrote:
> I'm a graduate student in the Information School at the University of
> Washington. I'm currently in the process of developing a prototype
> online IR system and I have been making use of Lucene's API. I'm just
> beginning to plan out some mecha
gt; 'deploi', 'deploying' to 'deploy', etc.
You want the PorterStemFilter (what you're talking about is 'stemming',
and the Porter stemmer is a specific popular instance of such). See the
Lucene FAQ section 2 #23 for info on Porter stemming, and #17
Melissa:
These questions are answered in the Lucene FAQ, which is located at
http://www.lucene.com/cgi-bin/faq/faqmanager.cgi
However, if I correctly understand your fundamental question, my
understanding is that Lucene basically uses the vector model of IR.
Joshua
[EMAIL PROTECTED] Per Obs
On Mon, 25 Feb 2002, Doug Cutting wrote:
> > From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]]
> >
> > You cannot, in general, structure a Lucene query such that it
> > will yield
> > the same document rankings that Google would for that (query, documen
You cannot, in general, structure a Lucene query such that it will yield
the same document rankings that Google would for that (query, document
set). The reason for this is that Google employs a scoring algorithm that
includes information about the topology of the pages (i.e., how the
pages are l
Actually, Winton's suggestion doesn't work because it's inconsistent with
the syntax of BooleanQuery() (the constructor doesn't take arguments, and
add() takes one Query argument, not two).
After considerable study of the documentation, I am still confused about
the semantics of BooleanQuery. I
e question
(2) above). Could someone please explain what MultiTermQuery is for,
how it should be used, etc.?
Thanks--
Joshua O'Madadhain
[EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
It's th
I have been experimenting with indexing a document set with different sets
of fields. Specifically, I start out with a "contents" field that
is a concatenation of all the elements of the original document in which
I'm interested. This gets me an index with about 7500 unique terms (which
I determ
kup tables by arrays of arrays,
represent them as hash tables (keyed by some munging of the string
which represents the term) of arrays.
Thanks in advance for any assistance.
Regards,
Joshua O'Madadhain
[EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Inform
[I'm taking the liberty of redirecting part of a conversation on Lucene
that I took off-list back on-list, since I think it's become generally
relevant.]
On Fri, 16 Nov 2001, Steven J. Owens wrote:
> > ...I still think it's easier for a project
> > consisting of three files to just compile the d
On Fri, 16 Nov 2001, Jeff Kunkle wrote:
> Hello. Does anyone know of a way to sort search results other than by
> score? It seems like it would be very useful to be able to sort by
> date or maybe even by any field that has been indexed (which I guess
> would include a date). From what I can t
ource to start hacking on would also be
appreciated.
Thanks in advance for any help that may be offered.
Regards,
Joshua O'Madadhain (Madden)
[EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
It's that
On Tue, 13 Nov 2001, Alex Murzaku wrote:
> Are you using ant? By just using "ant demo" from the lucene root
> directory everything goes fine. Make sure you have the latest ant
> (1.4).
I have no idea what ant is or what it's supposed to be for; I'll do a web
search to see if I can get one (an id
I am attempting to build the example code which is located in the
\lucene-1.2-rc2\src\demo\org\apache\lucene
directory in the distribution. Specifically, I'm trying to get
IndexFiles.java, SearchFiles.java, and FileDocument.java to compile, as a
sanity check, before trying to go any further.
48 matches
Mail list logo