Re: Summarization; sentence-level and document-level filters.
Gregor Heinrich wrote: Yes, copying a summary from one field to an untokenized field was the plan. I identified DocumentWriter.invertDocument() to be a possible place for an addition of this document-level analysis. But I admit this appears way too low-level and inflexible for the overall design. So I'll make it two-pass indexing. The way I did it: I'm indexing HTML documents, so before Lucene can do anything I need to run a HTML parser. This parser, while scanning the tags, builds two text strings at the same time: one that contains the document content for indexing and one that contains it for summarizing. There are relevant differences between those two strings, for example in handling of headlines and punctuation. Lucene then gets to index the first string and the second is given to a summarizer. The summary it returns is added to a Lucene field. This way I can do summarizing and indexing in one pass. Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Summarization; sentence-level and document-level filters.
Gregor, I don't have any benchmarks for summarization. Sorry! I have two testversions of commercial summarizers and their performance is better than the Classifier4J, but these have been written in C++. So you can't compare properly. regards, Maurits - Original Message - From: Gregor Heinrich [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Tuesday, December 16, 2003 9:35 PM Subject: RE: Summarization; sentence-level and document-level filters. Maurits: thanks for the hint to classifier4j -- I have had a look on this package and tried the SimpleSummarizer and it seems to work fine. (However, as I don't know the benchmarks for summarization, I'm not the one to judge.) Do you have experience with it? Gregor -Original Message- From: maurits van wijland [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 16, 2003 1:09 AM To: Lucene Users List; [EMAIL PROTECTED] Subject: Re: Summarization; sentence-level and document-level filters. Hi Gregor, Sofar as I know there is no summarizer in the plans. And maybe I can help you along the way. Have a look at Classifier4J project on Sourceforge. http://classifier4j.sourceforge.net/ It has a small documetn summarizer besides a bayes classifier.It might speed up your coding. On the level of lucene, I have no idea. My gut feeling says that a summary should be build before the text is tokenized! The tokenizer can ofcourse be used when analysing a document, but hooking into the lucene indexing is a bad idea I think. Someone else has any ideas? regards, Maurits - Original Message - From: Gregor Heinrich [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Monday, December 15, 2003 7:41 PM Subject: Summarization; sentence-level and document-level filters. Hi, is there any possibility to do sentence-level or document level analysis with the current Analysis/TokenStream architecture? Or where else is the best place to plug in customised document-level and sentence-level analysis features? Is there any precedence case ? My technical problem: I'd like to include a summarization feature into my system, which should (1) best make use of the architecture already there in Lucene, and (2) should be able to trigger summarization on a per-document basis while requiring sentence-level information, such as full-stops and commas. To preserve this punctuation, a special Tokenizer can be used that outputs such landmarks as tokens instead of filtering them out. The actual SummaryFilter then filters out the punctuation for its successors in the Analyzer's filter chain. The other, more complex thing is the document-level information: As Lucene's architecture uses a filter concept that does not know about the document the tokens are generated from (which is good abstraction), a document-specific operation like summarization is a bit of an awkward thing with this (and originally not intended, I guess). On the other hand, I'd like to have the existing filter structure in place for preprocessing of the input, because my raw texts are generated by converters from other formats that output unwanted chars (from figures, pagenumbers, etc.), which are filtered out anyway by my custom Analyzer. Any idea how to solve this second problem? Is there any support for such document / sentence structure analysis planned? Thanks and regards, Gregor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and Mysql
Hi, You should create a Lucene Document for each record in your table. Make each of the columns that contains text a field on the Document object. Also store the primary key of the record as a field. Here's a very basic article I wrote about using Lucene: http://builder.com.com/5100-6389-5054799.html Jeff - Original Message - From: Stefan Trcko [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, December 16, 2003 2:30 PM Subject: Lucene and Mysql Hello I'm new to Lucene. I want users can search text which is stored in mysql database. Is there any tutorial how to implement this kind of search feature. Best regards, Stefan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Displaying Query
Hi all, I use this code Query query = QueryParser.parse(q, Contenu, new Analyseur()); String larequet = query.toString(); System.out.println(la requête à traiter est: + larequet); And I have as this line displayed [EMAIL PROTECTED] I don't know Why I have't my query string displayed correctly. May someone help me. Best regards, Gayo
RE: Displaying Query
Try: String larequet = query.toString(default field name here); Example: String larequet = query.toString(texte); That should give string version of query. -Original Message- From: Gayo Diallo [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 10:46 AM To: [EMAIL PROTECTED] Subject: Displaying Query Hi all, I use this code Query query = QueryParser.parse(q, Contenu, new Analyseur()); String larequet = query.toString(); System.out.println(la requête à traiter est: + larequet); And I have as this line displayed [EMAIL PROTECTED] I don't know Why I have't my query string displayed correctly. May someone help me. Best regards, Gayo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Indexing Speed: Documents vs. Sentences
Hi, I am using Lucene to index a large number of web pages (a few 100GB) and the indexing speed is great. Lately I have been trying to index on a sentence level, not the document level. My problem is that the indexing speed has gone down dramatically and I am wondering if there is any way for me to improve on that. Indexing on a sentence level the overall amount of data stays the same while the number of records increases substantially (since there is usually many sentences to one web page). It seems to me like the indexing speed (everything else being the same) depends largely on the number of Documents inserted into the index, and not so much on the size of the data within the documents (correct?). I have played with the merge factor, using RAMDirectory, etc and I am quite comfortable with our overall configuration, so my guess is that that is not the issue (and I am QUITE happy with the indexing speed as long as I use complete pages and not sentences). Maybe there is a different way of attacking this? My goal is to be able to execute a query and get the sentences that match the query in the most efficient way while maintaining good/great indexing speed. I would prefer not having to search the complete document for the sentence in question. My current solution is to have one Lucene Document for each page (containing the URL and other information I require) that does NOT contain the text of the page. Then I have one Lucene Document for each sentence within that document, which contains the text of this particular sentence in addition to some identifying information that references the entry of the page itself. Any and all suggestions are welcome. Thanks! Jochen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing Speed: Documents vs. Sentences
I'm confused about something - what's the point of creating a document for every sentence? -Original Message- From: Jochen Frey [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 4:17 PM To: 'Lucene Users List' Subject: Indexing Speed: Documents vs. Sentences Hi, I am using Lucene to index a large number of web pages (a few 100GB) and the indexing speed is great. Lately I have been trying to index on a sentence level, not the document level. My problem is that the indexing speed has gone down dramatically and I am wondering if there is any way for me to improve on that. Indexing on a sentence level the overall amount of data stays the same while the number of records increases substantially (since there is usually many sentences to one web page). It seems to me like the indexing speed (everything else being the same) depends largely on the number of Documents inserted into the index, and not so much on the size of the data within the documents (correct?). I have played with the merge factor, using RAMDirectory, etc and I am quite comfortable with our overall configuration, so my guess is that that is not the issue (and I am QUITE happy with the indexing speed as long as I use complete pages and not sentences). Maybe there is a different way of attacking this? My goal is to be able to execute a query and get the sentences that match the query in the most efficient way while maintaining good/great indexing speed. I would prefer not having to search the complete document for the sentence in question. My current solution is to have one Lucene Document for each page (containing the URL and other information I require) that does NOT contain the text of the page. Then I have one Lucene Document for each sentence within that document, which contains the text of this particular sentence in addition to some identifying information that references the entry of the page itself. Any and all suggestions are welcome. Thanks! Jochen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing Speed: Documents vs. Sentences
Hi! In essence: 1) I don't care about the whole page 2) I only care about the actual sentence that matches the query. 3) I want the matching for the query only to happen within one sentence and not over sentence boundaries (even when I do a PhraseQuery with some slop). The query: i like the beach~20 should not match: And we go to the restaurant and i really like it. the beach was wonderful as well. 4) I would much prefer not to parse the actual page to find the sentence that matches the query (though I obviously will, if I have to). Does that answer your question? Thanks! Jochen -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 1:19 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences I'm confused about something - what's the point of creating a document for every sentence? -Original Message- From: Jochen Frey [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 4:17 PM To: 'Lucene Users List' Subject: Indexing Speed: Documents vs. Sentences Hi, I am using Lucene to index a large number of web pages (a few 100GB) and the indexing speed is great. Lately I have been trying to index on a sentence level, not the document level. My problem is that the indexing speed has gone down dramatically and I am wondering if there is any way for me to improve on that. Indexing on a sentence level the overall amount of data stays the same while the number of records increases substantially (since there is usually many sentences to one web page). It seems to me like the indexing speed (everything else being the same) depends largely on the number of Documents inserted into the index, and not so much on the size of the data within the documents (correct?). I have played with the merge factor, using RAMDirectory, etc and I am quite comfortable with our overall configuration, so my guess is that that is not the issue (and I am QUITE happy with the indexing speed as long as I use complete pages and not sentences). Maybe there is a different way of attacking this? My goal is to be able to execute a query and get the sentences that match the query in the most efficient way while maintaining good/great indexing speed. I would prefer not having to search the complete document for the sentence in question. My current solution is to have one Lucene Document for each page (containing the URL and other information I require) that does NOT contain the text of the page. Then I have one Lucene Document for each sentence within that document, which contains the text of this particular sentence in addition to some identifying information that references the entry of the page itself. Any and all suggestions are welcome. Thanks! Jochen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing Speed: Documents vs. Sentences
When you parse the page you can prevent sentence-boundry hits from matching your criteria -Original Message- From: Jochen Frey [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 4:34 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences Right. However, even if I do that, my problem #3 below remains unsolved: I do not wish to match phrases across sentence boundaries. Anyone have a neat solution (or pointers to one)? Thanks again! Jochen -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 1:29 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences Yeah. I'd suggest parsing the page, unfortunately. :) -Original Message- From: Jochen Frey [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 4:26 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences Hi! In essence: 1) I don't care about the whole page 2) I only care about the actual sentence that matches the query. 3) I want the matching for the query only to happen within one sentence and not over sentence boundaries (even when I do a PhraseQuery with some slop). The query: i like the beach~20 should not match: And we go to the restaurant and i really like it. the beach was wonderful as well. 4) I would much prefer not to parse the actual page to find the sentence that matches the query (though I obviously will, if I have to). Does that answer your question? Thanks! Jochen -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 1:19 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences I'm confused about something - what's the point of creating a document for every sentence? -Original Message- From: Jochen Frey [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 4:17 PM To: 'Lucene Users List' Subject: Indexing Speed: Documents vs. Sentences Hi, I am using Lucene to index a large number of web pages (a few 100GB) and the indexing speed is great. Lately I have been trying to index on a sentence level, not the document level. My problem is that the indexing speed has gone down dramatically and I am wondering if there is any way for me to improve on that. Indexing on a sentence level the overall amount of data stays the same while the number of records increases substantially (since there is usually many sentences to one web page). It seems to me like the indexing speed (everything else being the same) depends largely on the number of Documents inserted into the index, and not so much on the size of the data within the documents (correct?). I have played with the merge factor, using RAMDirectory, etc and I am quite comfortable with our overall configuration, so my guess is that that is not the issue (and I am QUITE happy with the indexing speed as long as I use complete pages and not sentences). Maybe there is a different way of attacking this? My goal is to be able to execute a query and get the sentences that match the query in the most efficient way while maintaining good/great indexing speed. I would prefer not having to search the complete document for the sentence in question. My current solution is to have one Lucene Document for each page (containing the URL and other information I require) that does NOT contain the text of the page. Then I have one Lucene Document for each sentence within that document, which contains the text of this particular sentence in addition to some identifying information that references the entry of the page itself. Any and all suggestions are welcome. Thanks! Jochen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing Speed: Documents vs. Sentences
Dan, I will send you a separate e-mail directly to your address. In the meanwhile, I hope to get input from other people. Maybe someone else knows how to solve my original problem below. Thanks! Jochen -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 1:36 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences When you parse the page you can prevent sentence-boundry hits from matching your criteria -Original Message- From: Jochen Frey [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 4:34 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences Right. However, even if I do that, my problem #3 below remains unsolved: I do not wish to match phrases across sentence boundaries. Anyone have a neat solution (or pointers to one)? Thanks again! Jochen -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 1:29 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences Yeah. I'd suggest parsing the page, unfortunately. :) -Original Message- From: Jochen Frey [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 4:26 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences Hi! In essence: 1) I don't care about the whole page 2) I only care about the actual sentence that matches the query. 3) I want the matching for the query only to happen within one sentence and not over sentence boundaries (even when I do a PhraseQuery with some slop). The query: i like the beach~20 should not match: And we go to the restaurant and i really like it. the beach was wonderful as well. 4) I would much prefer not to parse the actual page to find the sentence that matches the query (though I obviously will, if I have to). Does that answer your question? Thanks! Jochen -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 1:19 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences I'm confused about something - what's the point of creating a document for every sentence? -Original Message- From: Jochen Frey [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 4:17 PM To: 'Lucene Users List' Subject: Indexing Speed: Documents vs. Sentences Hi, I am using Lucene to index a large number of web pages (a few 100GB) and the indexing speed is great. Lately I have been trying to index on a sentence level, not the document level. My problem is that the indexing speed has gone down dramatically and I am wondering if there is any way for me to improve on that. Indexing on a sentence level the overall amount of data stays the same while the number of records increases substantially (since there is usually many sentences to one web page). It seems to me like the indexing speed (everything else being the same) depends largely on the number of Documents inserted into the index, and not so much on the size of the data within the documents (correct?). I have played with the merge factor, using RAMDirectory, etc and I am quite comfortable with our overall configuration, so my guess is that that is not the issue (and I am QUITE happy with the indexing speed as long as I use complete pages and not sentences). Maybe there is a different way of attacking this? My goal is to be able to execute a query and get the sentences that match the query in the most efficient way while maintaining good/great indexing speed. I would prefer not having to search the complete document for the sentence in question. My current solution is to have one Lucene Document for each page (containing the URL and other information I require) that does NOT contain the text of the page. Then I have one Lucene Document for each sentence within that document, which contains the text of this particular sentence in addition to some identifying information that references the entry of the page itself. Any and all suggestions are welcome. Thanks! Jochen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To