Herb,
On Friday 14 November 2003 13:39, Chong, Herb wrote:
you're describing ad-hoc solutions to a problem that have an effect, but
not one that is easily predictable. one can concoct all sorts of
combinations of the query operators that would have something of the effect
that i am describing.
On Friday 14 November 2003 13:39, Chong, Herb wrote:
> you're describing ad-hoc solutions to a problem that have an effect, but
> not one that is easily predictable. one can concoct all sorts of
> combinations of the query operators that would have something of the effect
> that i am describing. cr
On Friday 14 November 2003 11:50, Chong, Herb wrote:
> if you are handling inter correlation properly, then terms can't cross
> sentence boundaries. if you are not paying attention to sentence
> boundaries, then you are not following rules of linguistics.
Isn't that quite strict interpretation, ho
Hi,
We're seeing slow response time when we apply datefilter. A search that
takes 7 msec with no datefilter takes 368 msec when I filter on the last
fifteen days, and 632 msec on the last 30 days.
Initially we saved doing
document.add(Field.Keyword("dtstamp", dtstamp));
and then change to doin
Hello Herb,
I don't approve of several teasing, mean, etc. emails I saw from a few
people. This is a serious and polite email. :)
It sounds like you know about NLP and see places where Lucene could be
improved. Lucene is open source and free, and could benefit from
knowledgeable people like you
Well ... Sure, nothing can replace a human mind. But believe it or not,
there are studies which show that even human experts can significantly
differ in their opinions on what are key-phrases for a given text. So,
the results are never clear cut with humans either...
So, in this sense a heurist
PA,
But Lucene is an low level indexing library.
I'm sure most people here will agree that lucene is much more than a
_low level_ indexing library.
May be it is just a library, but definitely the *highest level* search
technology available in the web for free.
You ride roughshod over the hard
On Nov 14, 2003, at 21:14, Philippe Laflamme wrote:
Rules of linguistics? Is there such a thing? :)
Actually, yes there is. Natural Language Processing (NLP) is a very
broad
research subject but a lot has come out of it.
A lot of what? "If" statements? :)
Yes... just like every software boils down
Leo Galambos wrote:
There are other (more trivial) problems as well. One geek from UFAL (our
NLP lab) reported, that it was a hard problem to find the boundaries, or
rather, to say whether a dot is a dot or something else, i.e. "blah,
i.e. blah" "i.b.m." "i.p. pavlov" "3.14" "28.10.2003" etc.
O
you're describing ad-hoc solutions to a problem that have an effect, but not one that
is easily predictable. one can concoct all sorts of combinations of the query
operators that would have something of the effect that i am describing. crossing
sentence boundaries, however, can't be done without
I'm still confused what the issue here is. If you're interested in
stopping exact phrase matching from crossing sentence boundaries, that's
easy to do with setPositionIncrement(). If you want to score an exact
phrase match higher than a sloppy phrase match, and in turn score this
higher than
taking advantage of interterm correlations recognizing linguistic rules is an easy to
do thing compared to all the other schemes i know. it doesn't require genuine NLP,
external knowledge bases, and commonsense reasoning. it does require smarter document
parsers and an extended index to implemen
You should write your own code that creates the Document objects with
the fields you wish, with a Field.Keyword for the URL probably. Take
what is useful from IndexHTML.java, but don't use it as-is. If you're
speaking of pulling the document from a URL now you're talking of doing
some HTTP co
On Nov 14, 2003, at 21:16, Chong, Herb wrote:
if you know what TREC is, you know what i meant earlier. this isn't
exotic technology, this is close to 15 year old technology.
This is not really what I asked. What I would be interested to know is
what approach you consider to provide the "biggest
then i wouldn't have typed capital gains tax. there is psychology of query creation
too and that is one thing i am taking advantage of.
Herb
-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 3:15 PM
To: Lucene Users List
Subject: Re: inte
if you know what TREC is, you know what i meant earlier. this isn't exotic technology,
this is close to 15 year old technology.
Herb
-Original Message-
From: petite_abeille [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 3:12 PM
To: Lucene Users List
Subject: Re: Vector Spa
it has its limitations. that is why i am looking at what it would take to solve some
of them. parsing documents to recognize sentences and storing sentence boundaries in
the index would solve the ones that are most limiting. superposing interterm
correlation on top of Lucene isn't very useful be
Chong, Herb wrote:
since i am working now on financial news, here is an example:
capital gains tax
if i just run this query against a million document newswire index, i know i am going to get lots of hits. the phrase "capital gains tax" hits a lot fewer documents, but is overrestrictive. the fact
> >> Rules of linguistics? Is there such a thing? :)
> >
> > Actually, yes there is. Natural Language Processing (NLP) is a very
> > broad
> > research subject but a lot has come out of it.
>
> A lot of what? "If" statements? :)
Yes... just like every software boils down to branching and while loo
On Nov 14, 2003, at 20:54, Chong, Herb wrote:
it solves one part of the problem, but there are a lot of sentences in
a typical document. you'll need to composite a rank of a document from
its constituent sentences then. there are less drastic ways to solve
the problem. the other problem is that
I'm using lucene demo IndexHTML.java and pdfbox to index pdf files under a
file directory. I have a url point to the file directory. I'd like to
index the files using the URL instead of the file directory. Any idea how
to make IndexHTML take URL for indexing?
Thanks,
Oliver
On Friday, November 14, 2003, at 02:54 PM, Chong, Herb wrote:
it solves one part of the problem, but there are a lot of sentences in
a typical document. you'll need to composite a rank of a document from
its constituent sentences then. there are less drastic ways to solve
the problem. the other
it solves one part of the problem, but there are a lot of sentences in a typical
document. you'll need to composite a rank of a document from its constituent sentences
then. there are less drastic ways to solve the problem. the other problem is that
Lucene doesn't consider the term order in the
On Friday, November 14, 2003, at 02:32 PM, Chong, Herb wrote:
when people type in multiword queries, mostly they are interested in
phrases in the linguistic sense. phrases don't cross sentence
boundaries. you need certain features in the index and in the ranking
algorithm to capture that distin
On Nov 14, 2003, at 20:29, Philippe Laflamme wrote:
Rules of linguistics? Is there such a thing? :)
Actually, yes there is. Natural Language Processing (NLP) is a very
broad
research subject but a lot has come out of it.
A lot of what? "If" statements? :)
More specifically, Rule-based taggers ha
when people type in multiword queries, mostly they are interested in phrases in the
linguistic sense. phrases don't cross sentence boundaries. you need certain features
in the index and in the ranking algorithm to capture that distinction and rank
documents truly having that phrase higher than d
On Nov 14, 2003, at 20:27, Dror Matalon wrote:
I might be the only person on the list who's having a hard time
following this discussion.
Nope. I don't understand a word of what those guys are talking about
either :)
Would one of you wise folks care to point me
to a good "dummies", also known a
> Rules of linguistics? Is there such a thing? :)
Actually, yes there is. Natural Language Processing (NLP) is a very broad
research subject but a lot has come out of it.
More specifically, Rule-based taggers have become very popular since Eric
Brill published his works on trainable rule-based ta
something in the index needs to mark sentence boundaries so that words close together
in the query and found close together in the text get penalized for being separated by
a boundary. proximity isn't enough because in more complex queries, some of the
intermediate words might be omitted. in oth
Hi,
I might be the only person on the list who's having a hard time
following this discussion. Would one of you wise folks care to point me
to a good "dummies", also known as an executive summary, resource about
the theoretical background of all of this. I understand the basic
premise of collectin
On Friday, November 14, 2003, at 02:02 PM, Chong, Herb wrote:
if i just run this query against a million document newswire index, i
know i am going to get lots of hits. the phrase "capital gains tax"
hits a lot fewer documents, but is overrestrictive. the fact that the
three terms occur next to
On Nov 14, 2003, at 19:50, Chong, Herb wrote:
if you are handling inter correlation properly, then terms can't cross
sentence boundaries.
Could you not break down your document along sentences boundary? If you
manage to figure out what a sentence is, that is.
if you are not paying attention to
since i am working now on financial news, here is an example:
capital gains tax
if i just run this query against a million document newswire index, i know i am going
to get lots of hits. the phrase "capital gains tax" hits a lot fewer documents, but is
overrestrictive. the fact that the three t
On Friday, November 14, 2003, at 01:13 PM, Chong, Herb wrote:
if you didn't have to change the index then you haven't got all the
factors needed to do it well. terms can't cross sentence boundaries
and the index doesn't store sentence boundaries.
You mean if you have text like this: "Hello Herb.
if you are handling inter correlation properly, then terms can't cross sentence
boundaries. if you are not paying attention to sentence boundaries, then you are not
following rules of linguistics.
Herb...
-Original Message-
From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]
Sent: Friday
Not sure what you mean by "terms can't cross sentence boundaries". If
you're only using single-word terms, that's trivially true. What is it
that you're trying to achieve, exactly? (Your comment makes it sound
as though you simultaneously want and don't want sentence boundaries,
so I'm confu
if you didn't have to change the index then you haven't got all the factors needed to
do it well. terms can't cross sentence boundaries and the index doesn't store sentence
boundaries.
Herb...
-Original Message-
From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]
Sent: Friday, November 1
Incorporating inter-term correlation into Lucene isn't that hard; I've
done it. Nor is it incompatible with the vector-space model. I'm not
happy with the specific correlation metric that I picked, which is why
I'm not eager to generally release the code I wrote, but I think that
the basic me
i don't know of any open source search engine that incorporates interterm correlation.
i have been looking into how to do this in Lucene and so far, it's not been promising.
the indexing engine and file format needs to be changed. there are very few search
engines that incorporate interterm corr
Chong, Herb wrote:
like all vector space models i have come across, Lucene ignores interterm correlation.
Herb
Hmm... Are you perhaps familiar with some open system which doesn't? I'm
curious because one of my projects (already using Lucene) could benefit
from such feature. Right now I'm us
Erik is referring to the VERY latest version - the CVS :)
Otis
--- Erik Hatcher <[EMAIL PROTECTED]> wrote:
> On Thursday, November 13, 2003, at 06:46 PM, Tomcat Programmer
> wrote:
> > Hopefully the dev group will consider refactoring the
> > code so that when its doing the lexing it will throw
like all vector space models i have come across, Lucene ignores interterm correlation.
Herb
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
to me, vector space implies thinking inside the box.
Herb...
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
there are at least two or three others including pure probabilistic ones. most of the
ones i designed were probabilistic ones that are not vector space but more general.
vector space is a special case of a much more general probabilistic framework document
representation.
Herb
-Origina
The model implies the quality, thus it does matter.
ad "several important models") Are any of them implemented in Lucene?
Chong, Herb wrote:
does it matter? vector space is only one of several important ones.
Herb
-Original Message-
From: Leo Galambos [mailto:[EMAIL PROTECTED]
Sent
Hi,
>>
vector space is only one of several important ones.
>>
what are these several other important ones?
While Lucene does not give an explicit vector space
representation - you can not efficiently access the vector
of one document - the index' basic representation is
a reduction of each do
does it matter? vector space is only one of several important ones.
Herb
-Original Message-
From: Leo Galambos [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 4:00 AM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?
Really? And what model is used/implemente
On Thursday, November 13, 2003, at 04:32 PM, Jie Yang wrote:
Well, not quite, User normally enters a search string
A that normally returns 1000 out of 2 millions docs. I
then append A with 500 OR conditions... A AND (B or C
or ... or x500). I am trying to optimse the 500 OR
terms so that it does n
On Thursday, November 13, 2003, at 07:42 PM, Joseph Wilkicki wrote:
I'm having a problem with searching dates. I created two documents
with the
same date, 08/27/2002, in a lastModified field and then try and search
a
range lastModified:[20020827 TO 20020827] (Other, wider ranges, don't
seem
to
On Thursday, November 13, 2003, at 06:46 PM, Tomcat Programmer wrote:
Hopefully the dev group will consider refactoring the
code so that when its doing the lexing it will throw
TokenMgrException's instead of TokenMgrError's.
Throwing Errors should be reserved for only the most
nasty of conditions.
Really? And what model is used/implemented by Lucene?
THX
Leo
Otis Gospodnetic wrote:
Lucene does not implement vector space model.
Otis
--- [EMAIL PROTECTED] wrote:
Hi,
does Lucene implement a Vector Space Model? If yes, does anybody have
an
example of how using it?
Cheers,
Ralf
--
NEU FÜR
51 matches
Mail list logo