Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])
On Monday 17 November 2003 08:39, Chong, Herb wrote: > the core of the search engine has to have certain capabilities, however, > because they are next to impossible to add as a layer on top with any > efficiency. detecting sentence boundaries outside the core search engine is > really hard to do without building another search engine index. if i have > to do that, there is no point in using Lucene. It's also good to know what exactly constitutes core; I would assume that analyzer implementations are not part per se, as long as core knows how to use analyzers. But as long as index structure has some way to store information needed (perhaps by using existing property of distances between tokens, which allows both overlapping tokens and gaps, like someone suggested?), core need not know specifics of how analyzers determine structural (sentence etc) boundaries. To me this seems like one of many issues where it's possible to retain distinction between Lucene kernel (lean mean core) and more specialized functionality; highlighting was another one. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
On Monday 17 November 2003 07:40, Chong, Herb wrote: > i don't know what the Java implementation is like but the C++ one is very > fast. ... >> I personally do not have any experience with the BreakIterator in Java. Has >> anyone used it in any production environment? I'd be very interested to >> learn more about it's efficiency. Even if that implementation wasn't fast (which it should be), it should be fairly easy to implement it to be pretty much as efficient as any of basic tokenizers; ie. not much slower than full scanning speed over text data and token creation overhead. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: AW: Slow response time with datefilter
Hi, So we've implemented both suggestions and it made a big difference. You can see a Beta sample at http://www.fastbuzz.com/search/index.jsp We have around 7,000,000 items in the index. What we did: 1. Instead of using msec granularity, we're using hour granularity for date searches. This reduced search times from tens of seconds to 2-5 seconds. No ideal, but ... 2. We cache the results. So if you're looking for items in the last 15 days and then do a "next" it'll save the filter using CachingWrapperFilter and reuse it, resulting in much faster times the second time. This reduces the times from the above 2-5 secons to 0.2 - 0.8 msecs. One of the challenges though is that since the index is updated in real time, we can't cache for very long. We'll probably have to set up a mechanism to "seed" the cache before the "new" index becomes available. Regards, Dror On Sat, Nov 15, 2003 at 11:03:13AM -0800, Dror Matalon wrote: > After posting the original email, I started wondering if that's the > issue, the fact that we store timestamp up to the millisecond rather > than a more reasonable granularity. Dates are too high a granularity for > us, but minutes, and possibly hours should work. > > I'll report once we've tested some more. > > Regards, > > Dror > > On Sat, Nov 15, 2003 at 12:25:47PM -0500, Erik Hatcher wrote: > > On Saturday, November 15, 2003, at 11:38 AM, Karsten Konrad wrote: > > >If the number of different date terms causes this effect, why not > > >"round" > > >the date to the nearest or next midnight while indexing. Thus, > > >filtering > > >for the last 15 days would require walking over 15-17 different date > > >terms. > > >If you don't do this, the number of different terms will be the same as > > >the number of documents you indexed, explaining the slowing down when > > >you > > >have more results. > > > > I wholeheartedly concur. And in fact I don't use the Keyword(String, > > Date) thing at all if I just need to represent a date. I use MMDD > > as a String instead. It's just too fiddly to deal with dates using the > > built-in handling of it. > > > > Erik > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > -- > Dror Matalon > Zapatec Inc > 1700 MLK Way > Berkeley, CA 94709 > http://www.fastbuzz.com > http://www.zapatec.com > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.fastbuzz.com http://www.zapatec.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Which operations change document ids?
If they're optimized at different times then the document ids could get out of sync, as the optimized version will have deleted documents removed, while the un-optimized one won't. Also, for add/delete to keep document ids in sync you need to also be sure to use the same mergeFactor. Doug Jamie Stallwood wrote: If you create two parallel indices (to use different parsing methods for instance), and always add and delete documents in parallel, will the document ID's always correspond in both indices? And could optimization destroy any such invariance? -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: 17 November 2003 19:51 To: Lucene Users List Subject: Re: Which operations change document ids? Tate Avery wrote: My first question is: should I steer clear of this all together? No, I think this is appropriate. If not, I need to know which Lucene operations can cause document ids to change. I am assuming that the following can cause potential changes: 1) Add document 2) Optimize index What else could cause a document id to change? Nothing. And even these can only cause an id to change if there have been deletions. Could delete provoke a doc id change? Not when you perform the delete. Later, when you add to or optimize the index, the ids for deleted documents are reclaimed. And, I am assuming that the following DO NOT change the document id: 1) Query the index That is correct. Document ids never change with an instance of IndexReader. When you open a new index reader you should usually assume that ids have changed. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Which operations change document ids?
If you create two parallel indices (to use different parsing methods for instance), and always add and delete documents in parallel, will the document ID's always correspond in both indices? And could optimization destroy any such invariance? -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: 17 November 2003 19:51 To: Lucene Users List Subject: Re: Which operations change document ids? Tate Avery wrote: > My first question is: should I steer clear of this all together? No, I think this is appropriate. > If not, I need to know which Lucene operations can cause document ids to change. > > I am assuming that the following can cause potential changes: > 1) Add document > 2) Optimize index > > What else could cause a document id to change? Nothing. And even these can only cause an id to change if there have been deletions. > Could delete provoke a doc id change? Not when you perform the delete. Later, when you add to or optimize the index, the ids for deleted documents are reclaimed. > And, I am assuming that the following DO NOT change the document id: > > 1) Query the index That is correct. Document ids never change with an instance of IndexReader. When you open a new index reader you should usually assume that ids have changed. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
javacc2.0
Dear Brain: >From the list I found that you have the javacc2.0. Would you please send the package to me. I could not find it any where else. Thanks Jianshuo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]
Ah, I see. You have an absoulte interpretation. I am more relative. I think we're talking about a heuristic, not a law. Matches within a sentence are scored higher than those that are not. And the closer matching the terms are, whether within the same sentence or not, the greater the score. Given these two principals, at some point, as sentences get longer, a close match across sentence boundaries should probably score substantially higher than a very distant match within a sentence. Thus missing some distant yet still within-sentence matches in very long sentences probably won't substantially alter the ranking. Is 100 long enough? Perhaps not. But 1000 is certainly plenty long. Doug Chong, Herb wrote: any arbitrary number you pick will be broken by some document someone puts into the system. Herb -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 2:56 PM To: Lucene Users List Subject: Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?] This is exactly the sort of approach I was advocating in earlier messages. (Although I think you'd only need to increase the position counter by 101 for the first word in each sentence.) Herb Chong didn't seem to think this was appropriate, but I never understood why. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]
you could use the negative of the actual value. Herb -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 2:56 PM To: Lucene Users List Subject: Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?] This is exactly the sort of approach I was advocating in earlier messages. (Although I think you'd only need to increase the position counter by 101 for the first word in each sentence.) Herb Chong didn't seem to think this was appropriate, but I never understood why. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]
any arbitrary number you pick will be broken by some document someone puts into the system. Herb -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 2:56 PM To: Lucene Users List Subject: Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?] This is exactly the sort of approach I was advocating in earlier messages. (Although I think you'd only need to increase the position counter by 101 for the first word in each sentence.) Herb Chong didn't seem to think this was appropriate, but I never understood why. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]
Karsten Konrad wrote: I was wondering whether we could, while indexing, make a use of this by increasing the position counter by a large number, let's say 1000, whenever we encounter a sentence separator (Note, this is not trivial; not every '.' ends a sentence etc. etc. etc.). Thus, searching for "income tax"~100 "tax gain"~100 "income tax gain"~100 income tax gain would find "income tax gain" as usual, but would boost all texts where the phrases involved appear within sentence boundaries This is exactly the sort of approach I was advocating in earlier messages. (Although I think you'd only need to increase the position counter by 101 for the first word in each sentence.) Herb Chong didn't seem to think this was appropriate, but I never understood why. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Which operations change document ids?
Tate Avery wrote: My first question is: should I steer clear of this all together? No, I think this is appropriate. If not, I need to know which Lucene operations can cause document ids to change. I am assuming that the following can cause potential changes: 1) Add document 2) Optimize index What else could cause a document id to change? Nothing. And even these can only cause an id to change if there have been deletions. Could delete provoke a doc id change? Not when you perform the delete. Later, when you add to or optimize the index, the ids for deleted documents are reclaimed. And, I am assuming that the following DO NOT change the document id: 1) Query the index That is correct. Document ids never change with an instance of IndexReader. When you open a new index reader you should usually assume that ids have changed. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Which operations change document ids?
Hello, I am considering using the document id in order to implement a fast 'join' during relational search. My first question is: should I steer clear of this all together? And why? If not, I need to know which Lucene operations can cause document ids to change. I am assuming that the following can cause potential changes: 1) Add document - since it might trigger a merge 2) Optimize index - since it does trigger a merge 3) Update document - since it is a delete + add What else could cause a document id to change? Could delete provoke a doc id change? And, I am assuming that the following DO NOT change the document id: 1) Query the index Also, am I missing any others that will or will not cause a document id to change? Thank you, Tate P.S. It appears (to me) that the SearchBean (in lucene sandbox) sorting makes use of the Hits.id(int _n) method. How does it cope, if at all, with changes to the underlying document ids? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])
Dmitry once contributed a nice beefy patch that added Term Vector support to Lucene. While we never integrated the changes (for no good reason), I do recall that the patch was nice and elegant, because it allowed one to turn Term Vector support on/off at indexing time. If turned on, Lucene would collect information about terms and document that allows building of term vectors. If turned off, Lucene created only its normal index files. If you can provide something like that, I bet a lot of people would be interested. I have a feeling this won't get done unless you do it, though. Otis --- "Chong, Herb" <[EMAIL PROTECTED]> wrote: > the core of the search engine has to have certain capabilities, > however, because they are next to impossible to add as a layer on top > with any efficiency. detecting sentence boundaries outside the core > search engine is really hard to do without building another search > engine index. if i have to do that, there is no point in using > Lucene. > > Herb... > > -Original Message- > From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] > Sent: Monday, November 17, 2003 10:26 AM > To: Lucene Users List > Subject: Re: Contributing to Lucene (was RE: inter-term correlation > [was R e: Vector Space Model in Lucene?]) > > > Query expansion can (and I believe should) be done efficiently > outside > the core of search engine. After all, it's a process of changing the > query according to some expansion/rewriting algorithms, but it is > still > the unchanged search engine that in the end has to answer the new > query... > > -- > Best regards, > Andrzej Bialecki > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > __ Do you Yahoo!? Protect your identity with Yahoo! Mail AddressGuard http://antispam.yahoo.com/whatsnewfree - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])
the core of the search engine has to have certain capabilities, however, because they are next to impossible to add as a layer on top with any efficiency. detecting sentence boundaries outside the core search engine is really hard to do without building another search engine index. if i have to do that, there is no point in using Lucene. Herb... -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 10:26 AM To: Lucene Users List Subject: Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?]) Query expansion can (and I believe should) be done efficiently outside the core of search engine. After all, it's a process of changing the query according to some expansion/rewriting algorithms, but it is still the unchanged search engine that in the end has to answer the new query... -- Best regards, Andrzej Bialecki - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])
then sentence detection at indexing time shouldn't see them as sentences. no sentence detection is run on the query terms. Herb... -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 10:27 AM To: 'Lucene Users List' Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?]) I'm not sure I can share a sample, but the specific situation I'm thinking of is when you have data that doesn't exist within a sentence, for example the name, address, etc of a company. Some foreign companies have funky punctuation within their names and addresses. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])
Joe Paulsen wrote: Hope this isn't out of context - but Dan makes a very valid point. Besides the potential performance slowdown if NLP was always applied to a users query - there are times that an exact term match is desired without the "query expansion" that an NLP process normally requires. Query expansion can (and I believe should) be done efficiently outside the core of search engine. After all, it's a process of changing the query according to some expansion/rewriting algorithms, but it is still the unchanged search engine that in the end has to answer the new query... -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])
I'm not sure I can share a sample, but the specific situation I'm thinking of is when you have data that doesn't exist within a sentence, for example the name, address, etc of a company. Some foreign companies have funky punctuation within their names and addresses. I'd have to see the results to know if the NLP would mess anything up. If all it did was weight the results, then perhaps it wouldn't, but it's also possible that it would. Basically my concern is that it would mess up the use of lucene for non-sentence based applications that might contain punctuation. On the whole I think adding the NLP to lucene is a good idea because the vast majority of the applications of lucene would benefit from it. Making it optional could be a good way to maintain the current power of lucene and perhaps also retain the speed depending on the performance of the NLP functionality and the needs of the user. -Original Message- From: Chong, Herb [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 10:01 AM To: Lucene Users List Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?]) show an example document. Herb -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 9:48 AM To: 'Lucene Users List' Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?]) My only concern with this being integrated into lucene is that it be done in a way that doesn't make its use mandatory. Lucene is powerful enough that it can be used for a lot of cases where NLP doesn't make any sense. For example, I think that sentence boundaries would severely screw up the project I recently did using lucene because there are no sentences, but there is punctuation. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])
there is nothing i said about NLP. in fact my specific statements exclude NLP. the processing i am describing covers a linguistic observation and a constraint. a sequence of terms in the query receive a higher score when it occurs inside a single sentence than when it crosses a sentence boundary. also, there are many situations where NLP processing doesn't do any query expansion and reduces the number of possible documents that a query can match, thereby speeding up search. query expansion is only one way to use NLP, and i am not even interested in NLP changes to Lucene. Herb -Original Message- From: Joe Paulsen [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 10:12 AM To: Lucene Users List Subject: Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?]) Hope this isn't out of context - but Dan makes a very valid point. Besides the potential performance slowdown if NLP was always applied to a users query - there are times that an exact term match is desired without the "query expansion" that an NLP process normally requires. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])
we're talking 2-3% if this is done right. Herb... -Original Message- From: Joe Paulsen [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 10:12 AM To: Lucene Users List Subject: Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?]) Hope this isn't out of context - but Dan makes a very valid point. Besides the potential performance slowdown if NLP was always applied to a users query - there are times that an exact term match is desired without the "query expansion" that an NLP process normally requires. Joe - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])
Hope this isn't out of context - but Dan makes a very valid point. Besides the potential performance slowdown if NLP was always applied to a users query - there are times that an exact term match is desired without the "query expansion" that an NLP process normally requires. Joe - Original Message - From: "Chong, Herb" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, November 17, 2003 10:00 AM Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?]) show an example document. Herb -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 9:48 AM To: 'Lucene Users List' Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?]) My only concern with this being integrated into lucene is that it be done in a way that doesn't make its use mandatory. Lucene is powerful enough that it can be used for a lot of cases where NLP doesn't make any sense. For example, I think that sentence boundaries would severely screw up the project I recently did using lucene because there are no sentences, but there is punctuation. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])
show an example document. Herb -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 9:48 AM To: 'Lucene Users List' Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?]) My only concern with this being integrated into lucene is that it be done in a way that doesn't make its use mandatory. Lucene is powerful enough that it can be used for a lot of cases where NLP doesn't make any sense. For example, I think that sentence boundaries would severely screw up the project I recently did using lucene because there are no sentences, but there is punctuation. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])
My only concern with this being integrated into lucene is that it be done in a way that doesn't make its use mandatory. Lucene is powerful enough that it can be used for a lot of cases where NLP doesn't make any sense. For example, I think that sentence boundaries would severely screw up the project I recently did using lucene because there are no sentences, but there is punctuation. --- "Chong, Herb" <[EMAIL PROTECTED]> wrote: > that concept is that multiword queries > are mostly multiword terms and they can't cross sentence boundaries > according to the rules of English. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: inter-term correlation [was Re: Vector Space Model in Lucene?]
Hi, it actually is quite nice and it can be used in production for such things as have been discussed lately in this group. If you want to play it safe: The iterator breaks at dots after numbers (e.g. 15. March), the precision of the algorithm can be increased if you never break after a number. The implementation is fast. Regards, Karsten Mit freundlichen Grüßen aus Saarbrücken -- Dr.-Ing. Karsten Konrad Head of Artificial Intelligence Lab XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 [EMAIL PROTECTED] www.xtramind.com -Ursprüngliche Nachricht- Von: Philippe Laflamme [mailto:[EMAIL PROTECTED] Gesendet: Montag, 17. November 2003 15:39 An: Lucene Users List Betreff: RE: inter-term correlation [was Re: Vector Space Model in Lucene?] There is already an implementation in the Java API for sentence boundary detection. The BreakIterator in the java.text package has this to say about sentence splitting: "Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks such as quotation marks and parentheses." http://java.sun.com/j2se/1.4.1/docs/api/java/text/BreakIterator.html The whole i18n Java API is based on the ICU framework from IBM: http://oss.software.ibm.com/icu/index.html It supports many languages. I personally do not have any experience with the BreakIterator in Java. Has anyone used it in any production environment? I'd be very interested to learn more about it's efficiency. Regards, Phil > -Original Message- > From: Chong, Herb [mailto:[EMAIL PROTECTED] > Sent: November 17, 2003 08:53 > To: Lucene Users List > Subject: RE: inter-term correlation [was Re: Vector Space Model in > Lucene?] > > > i have a program written in Icon that does basic sentence splitting. > with about 5 heuristics and one small lookup table, i can get well > over 90% accuracy doing sentence boundary detection on email. for well > edited English text, like newswires, i can manage closer to 99%. this > is all that is needed for significantly improving a search engine's > performance when the query engine respects sentence boundaries. > incidentally, the GATE Information Extraction framework cites some > references that indicate that for named entity feature extraction, > their system can exceed the ability of trained humans to detect and > classify named entities if only one person does the detection. > collaborating humans are still better, but no-one has the time in > practical applications. > > you probably know, since you know about Markov chains, that within > sentence term correlation, and hence the language model, is different > than across sentences. linguists have known this for a very long time. > it isn't hard to put this capability into a search engine, but it > absolutely breaks down unless there is sentence boundary information > stored for use at query time. > > Herb > > -Original Message- > From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] > Sent: Friday, November 14, 2003 5:54 PM > To: Lucene Users List > Subject: Re: inter-term correlation [was Re: Vector Space Model in > Lucene?] > > > Well ... Sure, nothing can replace a human mind. But believe it or > not, there are studies which show that even human experts can > significantly differ in their opinions on what are key-phrases for a > given text. So, the results are never clear cut with humans either... > > So, in this sense a heuristic tool for sentence splitting and > key-phrase detection can go long ways. For example, the application I > mentioned, uses quite a few heuristic rules (+ Markov chains as a > heavier ammunition :-), and it comes up with the following phrases for > your email discussion (the text quoted below): > > (lang=EN): NLP, trainable rule-based tagging, natural language > processing, apache, NLP expert > > Now, this set of key-phrases does reflect the main noun-phrases in the > text... which means I have a practical and tangible benefit from NLP. > QED ;-) > > Best regards, > Andrzej > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: inter-term correlation [was Re: Vector Space Model in Lucene?]
i don't know what the Java implementation is like but the C++ one is very fast. Herb -Original Message- From: Philippe Laflamme [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 9:39 AM To: Lucene Users List Subject: RE: inter-term correlation [was Re: Vector Space Model in Lucene?] I personally do not have any experience with the BreakIterator in Java. Has anyone used it in any production environment? I'd be very interested to learn more about it's efficiency. Regards, Phil - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: inter-term correlation [was Re: Vector Space Model in Lucene?]
There is already an implementation in the Java API for sentence boundary detection. The BreakIterator in the java.text package has this to say about sentence splitting: "Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks such as quotation marks and parentheses." http://java.sun.com/j2se/1.4.1/docs/api/java/text/BreakIterator.html The whole i18n Java API is based on the ICU framework from IBM: http://oss.software.ibm.com/icu/index.html It supports many languages. I personally do not have any experience with the BreakIterator in Java. Has anyone used it in any production environment? I'd be very interested to learn more about it's efficiency. Regards, Phil > -Original Message- > From: Chong, Herb [mailto:[EMAIL PROTECTED] > Sent: November 17, 2003 08:53 > To: Lucene Users List > Subject: RE: inter-term correlation [was Re: Vector Space Model > in Lucene?] > > > i have a program written in Icon that does basic sentence > splitting. with about 5 heuristics and one small lookup table, i > can get well over 90% accuracy doing sentence boundary detection > on email. for well edited English text, like newswires, i can > manage closer to 99%. this is all that is needed for > significantly improving a search engine's performance when the > query engine respects sentence boundaries. incidentally, the GATE > Information Extraction framework cites some references that > indicate that for named entity feature extraction, their system > can exceed the ability of trained humans to detect and classify > named entities if only one person does the detection. > collaborating humans are still better, but no-one has the time in > practical applications. > > you probably know, since you know about Markov chains, that > within sentence term correlation, and hence the language model, > is different than across sentences. linguists have known this for > a very long time. it isn't hard to put this capability into a > search engine, but it absolutely breaks down unless there is > sentence boundary information stored for use at query time. > > Herb > > -Original Message- > From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] > Sent: Friday, November 14, 2003 5:54 PM > To: Lucene Users List > Subject: Re: inter-term correlation [was Re: Vector Space Model > in Lucene?] > > > Well ... Sure, nothing can replace a human mind. But believe it or not, > there are studies which show that even human experts can significantly > differ in their opinions on what are key-phrases for a given text. So, > the results are never clear cut with humans either... > > So, in this sense a heuristic tool for sentence splitting and key-phrase > detection can go long ways. For example, the application I mentioned, > uses quite a few heuristic rules (+ Markov chains as a heavier > ammunition :-), and it comes up with the following phrases for your > email discussion (the text quoted below): > > (lang=EN): NLP, trainable rule-based tagging, natural language > processing, apache, NLP expert > > Now, this set of key-phrases does reflect the main noun-phrases in the > text... which means I have a practical and tangible benefit from NLP. > QED ;-) > > Best regards, > Andrzej > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: inter-term correlation [was Re: Vector Space Model in Lucene?]
now you're talking. this is one way of doing it. you need to work out a heuristic to increment the counter enough that a misrecognized long sentence won't trigger this. however, one can argue that a sentence that contains 1000 words can't possibly be about one topic. Herb -Original Message- From: Karsten Konrad [mailto:[EMAIL PROTECTED] Sent: Saturday, November 15, 2003 7:16 AM To: Lucene Users List Subject: AW: inter-term correlation [was Re: Vector Space Model in Lucene?] Anyway, Herb is right, sentence boundaries do carry a meaning and the linguistic rule could be phrased as: "Constituents (Concepts) mentioned in one sentence together have a closer relation than those that are not." I was wondering whether we could, while indexing, make a use of this by increasing the position counter by a large number, let's say 1000, whenever we encounter a sentence separator (Note, this is not trivial; not every '.' ends a sentence etc. etc. etc.). Thus, searching for "income tax"~100 "tax gain"~100 "income tax gain"~100 income tax gain would find "income tax gain" as usual, but would boost all texts where the phrases involved appear within sentence boundaries - I assume that a sentence with 100 words would be pretty unlikely, but still within the 1000 word separation done by increasing the position. No linguistics necessary, actually, but it is an application of a linguistic rule! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: inter-term correlation [was Re: Vector Space Model in Lucene?]
respecting sentence boundaries and using them to affect a document's score in the ranking algorithm requires linguistic knowledge, not NLP knowledge. think about it. Herb -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 9:13 PM To: Lucene Users List Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?] What you can do is use a pos tagger (i.e. a maximum entropy model based or Brill tagger if you just have english) and use a data mining algorithm for weight your terms. May be you can use a hidden Markov model for that. You can build this on top of lucene, shouldn't be that difficult. But may be I understand you wrong.. .. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: inter-term correlation [was Re: Vector Space Model in Lucene?]
you cannot layer sentence boundary detection on top of Lucene and post process the hit list without effectively building a completely new search engine index. if i am going to go to this trouble, there is no point to using Lucene at all. Herb -Original Message- From: Tatu Saloranta [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 8:30 PM To: Lucene Users List Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?] Hmmh? You implied that there are some useful distance heuristics (words 5 words apart or more correlate much less), and others have pointed out Lucene has many useful components. Building more complex system from small components is usually considered a Good Thing (tm), not an "ad hoc solution". In fact, I would guess most experienced people around here start with Lucene defaults, and build their own systems gradually customizing more and more of pieces. It may be there are actual fundamental problems with Lucene, regarding approach you'd prefer, but I don't think it makes sense to brush off suggestions regarding distance & fuzzy/sloppy queries by claiming they are "just hacks". - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: inter-term correlation [was Re: Vector Space Model in Lucene?]
i am not implying rejection of a match across sentence boundaries, i am saying that it receives a lower score than a match within a sentence boundary. Herb -Original Message- From: Tatu Saloranta [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 8:15 PM To: Lucene Users List Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?] Isn't that quite strict interpretation, however? There are many cases where linguistically separate sentences do have strong dependendies; in web world simple things like list items may be very closely related. Put another way; it may not be trivially easy to detect sentence boundaries, nor is it certain that what (from language viewpoint) is a boundary really is hard boundary from semantic perspective? And are there not varying levels of separation (sentences close to each other often are related, back references being common), not just one, between sentences? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Contributing to Lucene (was RE: inter-term correlation [was Re: Vector Space Model in Lucene?])
i am stuck with company policy with respect to open source project participation. this is why i am dropping some fairly detailed hints of what has to be done instead of doing it myself. this policy may change in the next year, but by then, i will have to be working with a solution and not just looking for one. Herb -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 6:45 PM To: Lucene Users List Subject: Contributing to Lucene (was RE: inter-term correlation [was Re: Vector Space Model in Lucene?]) Hello Herb, I don't approve of several teasing, mean, etc. emails I saw from a few people. This is a serious and polite email. :) It sounds like you know about NLP and see places where Lucene could be improved. Lucene is open source and free, and could benefit from knowledgeable people like you. Are you interested in contributing some computational linguistics smarts, either as improvement of Lucene core (if improvements are such that they don't make Lucene use more difficult and its code significantly more complex and harder to maintain), or as an add-on module, or some kind of an extension, or even just as application built on top of Lucene, all of which could and would live outside of Lucene's core? Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: inter-term correlation [was Re: Vector Space Model in Lucene?]
i have a program written in Icon that does basic sentence splitting. with about 5 heuristics and one small lookup table, i can get well over 90% accuracy doing sentence boundary detection on email. for well edited English text, like newswires, i can manage closer to 99%. this is all that is needed for significantly improving a search engine's performance when the query engine respects sentence boundaries. incidentally, the GATE Information Extraction framework cites some references that indicate that for named entity feature extraction, their system can exceed the ability of trained humans to detect and classify named entities if only one person does the detection. collaborating humans are still better, but no-one has the time in practical applications. you probably know, since you know about Markov chains, that within sentence term correlation, and hence the language model, is different than across sentences. linguists have known this for a very long time. it isn't hard to put this capability into a search engine, but it absolutely breaks down unless there is sentence boundary information stored for use at query time. Herb -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 5:54 PM To: Lucene Users List Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?] Well ... Sure, nothing can replace a human mind. But believe it or not, there are studies which show that even human experts can significantly differ in their opinions on what are key-phrases for a given text. So, the results are never clear cut with humans either... So, in this sense a heuristic tool for sentence splitting and key-phrase detection can go long ways. For example, the application I mentioned, uses quite a few heuristic rules (+ Markov chains as a heavier ammunition :-), and it comes up with the following phrases for your email discussion (the text quoted below): (lang=EN): NLP, trainable rule-based tagging, natural language processing, apache, NLP expert Now, this set of key-phrases does reflect the main noun-phrases in the text... which means I have a practical and tangible benefit from NLP. QED ;-) Best regards, Andrzej - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]