Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Stefan Groschupf
Herb, On Friday 14 November 2003 13:39, Chong, Herb wrote: you're describing ad-hoc solutions to a problem that have an effect, but not one that is easily predictable. one can concoct all sorts of combinations of the query operators that would have something of the effect that i am describing.

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Tatu Saloranta
On Friday 14 November 2003 13:39, Chong, Herb wrote: > you're describing ad-hoc solutions to a problem that have an effect, but > not one that is easily predictable. one can concoct all sorts of > combinations of the query operators that would have something of the effect > that i am describing. cr

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Tatu Saloranta
On Friday 14 November 2003 11:50, Chong, Herb wrote: > if you are handling inter correlation properly, then terms can't cross > sentence boundaries. if you are not paying attention to sentence > boundaries, then you are not following rules of linguistics. Isn't that quite strict interpretation, ho

Slow response time with datefilter

2003-11-14 Thread Dror Matalon
Hi, We're seeing slow response time when we apply datefilter. A search that takes 7 msec with no datefilter takes 368 msec when I filter on the last fifteen days, and 632 msec on the last 30 days. Initially we saved doing document.add(Field.Keyword("dtstamp", dtstamp)); and then change to doin

Contributing to Lucene (was RE: inter-term correlation [was Re: Vector Space Model in Lucene?])

2003-11-14 Thread Otis Gospodnetic
Hello Herb, I don't approve of several teasing, mean, etc. emails I saw from a few people. This is a serious and polite email. :) It sounds like you know about NLP and see places where Lucene could be improved. Lucene is open source and free, and could benefit from knowledgeable people like you

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Andrzej Bialecki
Well ... Sure, nothing can replace a human mind. But believe it or not, there are studies which show that even human experts can significantly differ in their opinions on what are key-phrases for a given text. So, the results are never clear cut with humans either... So, in this sense a heurist

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Stefan Groschupf
PA, But Lucene is an low level indexing library. I'm sure most people here will agree that lucene is much more than a _low level_ indexing library. May be it is just a library, but definitely the *highest level* search technology available in the web for free. You ride roughshod over the hard

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 21:14, Philippe Laflamme wrote: Rules of linguistics? Is there such a thing? :) Actually, yes there is. Natural Language Processing (NLP) is a very broad research subject but a lot has come out of it. A lot of what? "If" statements? :) Yes... just like every software boils down

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Doug Cutting
Leo Galambos wrote: There are other (more trivial) problems as well. One geek from UFAL (our NLP lab) reported, that it was a hard problem to find the boundaries, or rather, to say whether a dot is a dot or something else, i.e. "blah, i.e. blah" "i.b.m." "i.p. pavlov" "3.14" "28.10.2003" etc. O

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Chong, Herb
you're describing ad-hoc solutions to a problem that have an effect, but not one that is easily predictable. one can concoct all sorts of combinations of the query operators that would have something of the effect that i am describing. crossing sentence boundaries, however, can't be done without

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Doug Cutting
I'm still confused what the issue here is. If you're interested in stopping exact phrase matching from crossing sentence boundaries, that's easy to do with setPositionIncrement(). If you want to score an exact phrase match higher than a sloppy phrase match, and in turn score this higher than

RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
taking advantage of interterm correlations recognizing linguistic rules is an easy to do thing compared to all the other schemes i know. it doesn't require genuine NLP, external knowledge bases, and commonsense reasoning. it does require smarter document parsers and an extended index to implemen

Re: Index using URL

2003-11-14 Thread Erik Hatcher
You should write your own code that creates the Document objects with the fields you wish, with a Field.Keyword for the URL probably. Take what is useful from IndexHTML.java, but don't use it as-is. If you're speaking of pulling the document from a URL now you're talking of doing some HTTP co

Re: Vector Space Model in Lucene?

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 21:16, Chong, Herb wrote: if you know what TREC is, you know what i meant earlier. this isn't exotic technology, this is close to 15 year old technology. This is not really what I asked. What I would be interested to know is what approach you consider to provide the "biggest

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Chong, Herb
then i wouldn't have typed capital gains tax. there is psychology of query creation too and that is one thing i am taking advantage of. Herb -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 3:15 PM To: Lucene Users List Subject: Re: inte

RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
if you know what TREC is, you know what i meant earlier. this isn't exotic technology, this is close to 15 year old technology. Herb -Original Message- From: petite_abeille [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 3:12 PM To: Lucene Users List Subject: Re: Vector Spa

RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
it has its limitations. that is why i am looking at what it would take to solve some of them. parsing documents to recognize sentences and storing sentence boundaries in the index would solve the ones that are most limiting. superposing interterm correlation on top of Lucene isn't very useful be

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Doug Cutting
Chong, Herb wrote: since i am working now on financial news, here is an example: capital gains tax if i just run this query against a million document newswire index, i know i am going to get lots of hits. the phrase "capital gains tax" hits a lot fewer documents, but is overrestrictive. the fact

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Philippe Laflamme
> >> Rules of linguistics? Is there such a thing? :) > > > > Actually, yes there is. Natural Language Processing (NLP) is a very > > broad > > research subject but a lot has come out of it. > > A lot of what? "If" statements? :) Yes... just like every software boils down to branching and while loo

Re: Vector Space Model in Lucene?

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 20:54, Chong, Herb wrote: it solves one part of the problem, but there are a lot of sentences in a typical document. you'll need to composite a rank of a document from its constituent sentences then. there are less drastic ways to solve the problem. the other problem is that

Index using URL

2003-11-14 Thread Zhou, Oliver
I'm using lucene demo IndexHTML.java and pdfbox to index pdf files under a file directory. I have a url point to the file directory. I'd like to index the files using the URL instead of the file directory. Any idea how to make IndexHTML take URL for indexing? Thanks, Oliver

Re: Vector Space Model in Lucene?

2003-11-14 Thread Erik Hatcher
On Friday, November 14, 2003, at 02:54 PM, Chong, Herb wrote: it solves one part of the problem, but there are a lot of sentences in a typical document. you'll need to composite a rank of a document from its constituent sentences then. there are less drastic ways to solve the problem. the other

RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
it solves one part of the problem, but there are a lot of sentences in a typical document. you'll need to composite a rank of a document from its constituent sentences then. there are less drastic ways to solve the problem. the other problem is that Lucene doesn't consider the term order in the

Re: Vector Space Model in Lucene?

2003-11-14 Thread Erik Hatcher
On Friday, November 14, 2003, at 02:32 PM, Chong, Herb wrote: when people type in multiword queries, mostly they are interested in phrases in the linguistic sense. phrases don't cross sentence boundaries. you need certain features in the index and in the ranking algorithm to capture that distin

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 20:29, Philippe Laflamme wrote: Rules of linguistics? Is there such a thing? :) Actually, yes there is. Natural Language Processing (NLP) is a very broad research subject but a lot has come out of it. A lot of what? "If" statements? :) More specifically, Rule-based taggers ha

RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
when people type in multiword queries, mostly they are interested in phrases in the linguistic sense. phrases don't cross sentence boundaries. you need certain features in the index and in the ranking algorithm to capture that distinction and rank documents truly having that phrase higher than d

Re: Vector Space Model in Lucene?

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 20:27, Dror Matalon wrote: I might be the only person on the list who's having a hard time following this discussion. Nope. I don't understand a word of what those guys are talking about either :) Would one of you wise folks care to point me to a good "dummies", also known a

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Philippe Laflamme
> Rules of linguistics? Is there such a thing? :) Actually, yes there is. Natural Language Processing (NLP) is a very broad research subject but a lot has come out of it. More specifically, Rule-based taggers have become very popular since Eric Brill published his works on trainable rule-based ta

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Chong, Herb
something in the index needs to mark sentence boundaries so that words close together in the query and found close together in the text get penalized for being separated by a boundary. proximity isn't enough because in more complex queries, some of the intermediate words might be omitted. in oth

Re: Vector Space Model in Lucene?

2003-11-14 Thread Dror Matalon
Hi, I might be the only person on the list who's having a hard time following this discussion. Would one of you wise folks care to point me to a good "dummies", also known as an executive summary, resource about the theoretical background of all of this. I understand the basic premise of collectin

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Erik Hatcher
On Friday, November 14, 2003, at 02:02 PM, Chong, Herb wrote: if i just run this query against a million document newswire index, i know i am going to get lots of hits. the phrase "capital gains tax" hits a lot fewer documents, but is overrestrictive. the fact that the three terms occur next to

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 19:50, Chong, Herb wrote: if you are handling inter correlation properly, then terms can't cross sentence boundaries. Could you not break down your document along sentences boundary? If you manage to figure out what a sentence is, that is. if you are not paying attention to

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Chong, Herb
since i am working now on financial news, here is an example: capital gains tax if i just run this query against a million document newswire index, i know i am going to get lots of hits. the phrase "capital gains tax" hits a lot fewer documents, but is overrestrictive. the fact that the three t

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Erik Hatcher
On Friday, November 14, 2003, at 01:13 PM, Chong, Herb wrote: if you didn't have to change the index then you haven't got all the factors needed to do it well. terms can't cross sentence boundaries and the index doesn't store sentence boundaries. You mean if you have text like this: "Hello Herb.

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Chong, Herb
if you are handling inter correlation properly, then terms can't cross sentence boundaries. if you are not paying attention to sentence boundaries, then you are not following rules of linguistics. Herb... -Original Message- From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED] Sent: Friday

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Joshua O'Madadhain
Not sure what you mean by "terms can't cross sentence boundaries". If you're only using single-word terms, that's trivially true. What is it that you're trying to achieve, exactly? (Your comment makes it sound as though you simultaneously want and don't want sentence boundaries, so I'm confu

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Chong, Herb
if you didn't have to change the index then you haven't got all the factors needed to do it well. terms can't cross sentence boundaries and the index doesn't store sentence boundaries. Herb... -Original Message- From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED] Sent: Friday, November 1

inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Joshua O'Madadhain
Incorporating inter-term correlation into Lucene isn't that hard; I've done it. Nor is it incompatible with the vector-space model. I'm not happy with the specific correlation metric that I picked, which is why I'm not eager to generally release the code I wrote, but I think that the basic me

RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
i don't know of any open source search engine that incorporates interterm correlation. i have been looking into how to do this in Lucene and so far, it's not been promising. the indexing engine and file format needs to be changed. there are very few search engines that incorporate interterm corr

Re: Vector Space Model in Lucene?

2003-11-14 Thread Andrzej Bialecki
Chong, Herb wrote: like all vector space models i have come across, Lucene ignores interterm correlation. Herb Hmm... Are you perhaps familiar with some open system which doesn't? I'm curious because one of my projects (already using Lucene) could benefit from such feature. Right now I'm us

Re: QueryParser Rules article (Erik Hatcher)

2003-11-14 Thread Otis Gospodnetic
Erik is referring to the VERY latest version - the CVS :) Otis --- Erik Hatcher <[EMAIL PROTECTED]> wrote: > On Thursday, November 13, 2003, at 06:46 PM, Tomcat Programmer > wrote: > > Hopefully the dev group will consider refactoring the > > code so that when its doing the lexing it will throw

RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
like all vector space models i have come across, Lucene ignores interterm correlation. Herb - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
to me, vector space implies thinking inside the box. Herb... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
there are at least two or three others including pure probabilistic ones. most of the ones i designed were probabilistic ones that are not vector space but more general. vector space is a special case of a much more general probabilistic framework document representation. Herb -Origina

Re: Vector Space Model in Lucene?

2003-11-14 Thread Leo Galambos
The model implies the quality, thus it does matter. ad "several important models") Are any of them implemented in Lucene? Chong, Herb wrote: does it matter? vector space is only one of several important ones. Herb -Original Message- From: Leo Galambos [mailto:[EMAIL PROTECTED] Sent

AW: Vector Space Model in Lucene?

2003-11-14 Thread Karsten Konrad
Hi, >> vector space is only one of several important ones. >> what are these several other important ones? While Lucene does not give an explicit vector space representation - you can not efficiently access the vector of one document - the index' basic representation is a reduction of each do

RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
does it matter? vector space is only one of several important ones. Herb -Original Message- From: Leo Galambos [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 4:00 AM To: Lucene Users List Subject: Re: Vector Space Model in Lucene? Really? And what model is used/implemente

Re: Query Filters on term A in query "A AND (B OR C OR D)"

2003-11-14 Thread Erik Hatcher
On Thursday, November 13, 2003, at 04:32 PM, Jie Yang wrote: Well, not quite, User normally enters a search string A that normally returns 1000 out of 2 millions docs. I then append A with 500 OR conditions... A AND (B or C or ... or x500). I am trying to optimse the 500 OR terms so that it does n

Re: Problem with Date Range search

2003-11-14 Thread Erik Hatcher
On Thursday, November 13, 2003, at 07:42 PM, Joseph Wilkicki wrote: I'm having a problem with searching dates. I created two documents with the same date, 08/27/2002, in a lastModified field and then try and search a range lastModified:[20020827 TO 20020827] (Other, wider ranges, don't seem to

Re: QueryParser Rules article (Erik Hatcher)

2003-11-14 Thread Erik Hatcher
On Thursday, November 13, 2003, at 06:46 PM, Tomcat Programmer wrote: Hopefully the dev group will consider refactoring the code so that when its doing the lexing it will throw TokenMgrException's instead of TokenMgrError's. Throwing Errors should be reserved for only the most nasty of conditions.

Re: Vector Space Model in Lucene?

2003-11-14 Thread Leo Galambos
Really? And what model is used/implemented by Lucene? THX Leo Otis Gospodnetic wrote: Lucene does not implement vector space model. Otis --- [EMAIL PROTECTED] wrote: Hi, does Lucene implement a Vector Space Model? If yes, does anybody have an example of how using it? Cheers, Ralf -- NEU FÜR