Re: Indexing Speed: Documents vs. Sentences
Hi, I am using Lucene to index a large number of web pages (a few 100GB) and the indexing speed is great. Jochen .. a few 100 GB? Is this correct? /victor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing Speed: Documents vs. Sentences
Hi, Yes, this is correct, I am dealing with a few 100GB (close to 1TB). I am, however, distributing the data across several machines and then merge the results from all the machines together (until I find a better faster solution). Cheers! -Original Message- From: Victor Hadianto [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 10:50 PM To: Lucene Users List Subject: Re: Indexing Speed: Documents vs. Sentences Hi, I am using Lucene to index a large number of web pages (a few 100GB) and the indexing speed is great. Jochen .. a few 100 GB? Is this correct? /victor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing Speed: Documents vs. Sentences
Interesting. What hardware are you using for searching 1TB of data, and how fast is the repsonse time? On Thu, Dec 18, 2003 at 08:23:42AM -0800, Jochen Frey wrote: Hi, Yes, this is correct, I am dealing with a few 100GB (close to 1TB). I am, however, distributing the data across several machines and then merge the results from all the machines together (until I find a better faster solution). Cheers! -Original Message- From: Victor Hadianto [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 10:50 PM To: Lucene Users List Subject: Re: Indexing Speed: Documents vs. Sentences Hi, I am using Lucene to index a large number of web pages (a few 100GB) and the indexing speed is great. Jochen .. a few 100 GB? Is this correct? /victor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.fastbuzz.com http://www.zapatec.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing Speed: Documents vs. Sentences
Interesting. What hardware are you using for searching 1TB of data, and how fast is the repsonse time? Me too :) I'm interested on how many documents you indexed. Do you have lots and lots of document or you have big sizes documents? /victor On Thu, Dec 18, 2003 at 08:23:42AM -0800, Jochen Frey wrote: Hi, Yes, this is correct, I am dealing with a few 100GB (close to 1TB). I am, however, distributing the data across several machines and then merge the results from all the machines together (until I find a better faster solution). Cheers! -Original Message- From: Victor Hadianto [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 10:50 PM To: Lucene Users List Subject: Re: Indexing Speed: Documents vs. Sentences Hi, I am using Lucene to index a large number of web pages (a few 100GB) and the indexing speed is great. Jochen .. a few 100 GB? Is this correct? /victor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.fastbuzz.com http://www.zapatec.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing Speed: Documents vs. Sentences
I'm confused about something - what's the point of creating a document for every sentence? -Original Message- From: Jochen Frey [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 4:17 PM To: 'Lucene Users List' Subject: Indexing Speed: Documents vs. Sentences Hi, I am using Lucene to index a large number of web pages (a few 100GB) and the indexing speed is great. Lately I have been trying to index on a sentence level, not the document level. My problem is that the indexing speed has gone down dramatically and I am wondering if there is any way for me to improve on that. Indexing on a sentence level the overall amount of data stays the same while the number of records increases substantially (since there is usually many sentences to one web page). It seems to me like the indexing speed (everything else being the same) depends largely on the number of Documents inserted into the index, and not so much on the size of the data within the documents (correct?). I have played with the merge factor, using RAMDirectory, etc and I am quite comfortable with our overall configuration, so my guess is that that is not the issue (and I am QUITE happy with the indexing speed as long as I use complete pages and not sentences). Maybe there is a different way of attacking this? My goal is to be able to execute a query and get the sentences that match the query in the most efficient way while maintaining good/great indexing speed. I would prefer not having to search the complete document for the sentence in question. My current solution is to have one Lucene Document for each page (containing the URL and other information I require) that does NOT contain the text of the page. Then I have one Lucene Document for each sentence within that document, which contains the text of this particular sentence in addition to some identifying information that references the entry of the page itself. Any and all suggestions are welcome. Thanks! Jochen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing Speed: Documents vs. Sentences
Hi! In essence: 1) I don't care about the whole page 2) I only care about the actual sentence that matches the query. 3) I want the matching for the query only to happen within one sentence and not over sentence boundaries (even when I do a PhraseQuery with some slop). The query: i like the beach~20 should not match: And we go to the restaurant and i really like it. the beach was wonderful as well. 4) I would much prefer not to parse the actual page to find the sentence that matches the query (though I obviously will, if I have to). Does that answer your question? Thanks! Jochen -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 1:19 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences I'm confused about something - what's the point of creating a document for every sentence? -Original Message- From: Jochen Frey [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 4:17 PM To: 'Lucene Users List' Subject: Indexing Speed: Documents vs. Sentences Hi, I am using Lucene to index a large number of web pages (a few 100GB) and the indexing speed is great. Lately I have been trying to index on a sentence level, not the document level. My problem is that the indexing speed has gone down dramatically and I am wondering if there is any way for me to improve on that. Indexing on a sentence level the overall amount of data stays the same while the number of records increases substantially (since there is usually many sentences to one web page). It seems to me like the indexing speed (everything else being the same) depends largely on the number of Documents inserted into the index, and not so much on the size of the data within the documents (correct?). I have played with the merge factor, using RAMDirectory, etc and I am quite comfortable with our overall configuration, so my guess is that that is not the issue (and I am QUITE happy with the indexing speed as long as I use complete pages and not sentences). Maybe there is a different way of attacking this? My goal is to be able to execute a query and get the sentences that match the query in the most efficient way while maintaining good/great indexing speed. I would prefer not having to search the complete document for the sentence in question. My current solution is to have one Lucene Document for each page (containing the URL and other information I require) that does NOT contain the text of the page. Then I have one Lucene Document for each sentence within that document, which contains the text of this particular sentence in addition to some identifying information that references the entry of the page itself. Any and all suggestions are welcome. Thanks! Jochen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing Speed: Documents vs. Sentences
When you parse the page you can prevent sentence-boundry hits from matching your criteria -Original Message- From: Jochen Frey [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 4:34 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences Right. However, even if I do that, my problem #3 below remains unsolved: I do not wish to match phrases across sentence boundaries. Anyone have a neat solution (or pointers to one)? Thanks again! Jochen -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 1:29 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences Yeah. I'd suggest parsing the page, unfortunately. :) -Original Message- From: Jochen Frey [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 4:26 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences Hi! In essence: 1) I don't care about the whole page 2) I only care about the actual sentence that matches the query. 3) I want the matching for the query only to happen within one sentence and not over sentence boundaries (even when I do a PhraseQuery with some slop). The query: i like the beach~20 should not match: And we go to the restaurant and i really like it. the beach was wonderful as well. 4) I would much prefer not to parse the actual page to find the sentence that matches the query (though I obviously will, if I have to). Does that answer your question? Thanks! Jochen -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 1:19 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences I'm confused about something - what's the point of creating a document for every sentence? -Original Message- From: Jochen Frey [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 4:17 PM To: 'Lucene Users List' Subject: Indexing Speed: Documents vs. Sentences Hi, I am using Lucene to index a large number of web pages (a few 100GB) and the indexing speed is great. Lately I have been trying to index on a sentence level, not the document level. My problem is that the indexing speed has gone down dramatically and I am wondering if there is any way for me to improve on that. Indexing on a sentence level the overall amount of data stays the same while the number of records increases substantially (since there is usually many sentences to one web page). It seems to me like the indexing speed (everything else being the same) depends largely on the number of Documents inserted into the index, and not so much on the size of the data within the documents (correct?). I have played with the merge factor, using RAMDirectory, etc and I am quite comfortable with our overall configuration, so my guess is that that is not the issue (and I am QUITE happy with the indexing speed as long as I use complete pages and not sentences). Maybe there is a different way of attacking this? My goal is to be able to execute a query and get the sentences that match the query in the most efficient way while maintaining good/great indexing speed. I would prefer not having to search the complete document for the sentence in question. My current solution is to have one Lucene Document for each page (containing the URL and other information I require) that does NOT contain the text of the page. Then I have one Lucene Document for each sentence within that document, which contains the text of this particular sentence in addition to some identifying information that references the entry of the page itself. Any and all suggestions are welcome. Thanks! Jochen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing Speed: Documents vs. Sentences
Dan, I will send you a separate e-mail directly to your address. In the meanwhile, I hope to get input from other people. Maybe someone else knows how to solve my original problem below. Thanks! Jochen -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 1:36 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences When you parse the page you can prevent sentence-boundry hits from matching your criteria -Original Message- From: Jochen Frey [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 4:34 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences Right. However, even if I do that, my problem #3 below remains unsolved: I do not wish to match phrases across sentence boundaries. Anyone have a neat solution (or pointers to one)? Thanks again! Jochen -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 1:29 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences Yeah. I'd suggest parsing the page, unfortunately. :) -Original Message- From: Jochen Frey [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 4:26 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences Hi! In essence: 1) I don't care about the whole page 2) I only care about the actual sentence that matches the query. 3) I want the matching for the query only to happen within one sentence and not over sentence boundaries (even when I do a PhraseQuery with some slop). The query: i like the beach~20 should not match: And we go to the restaurant and i really like it. the beach was wonderful as well. 4) I would much prefer not to parse the actual page to find the sentence that matches the query (though I obviously will, if I have to). Does that answer your question? Thanks! Jochen -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 1:19 PM To: 'Lucene Users List' Subject: RE: Indexing Speed: Documents vs. Sentences I'm confused about something - what's the point of creating a document for every sentence? -Original Message- From: Jochen Frey [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 17, 2003 4:17 PM To: 'Lucene Users List' Subject: Indexing Speed: Documents vs. Sentences Hi, I am using Lucene to index a large number of web pages (a few 100GB) and the indexing speed is great. Lately I have been trying to index on a sentence level, not the document level. My problem is that the indexing speed has gone down dramatically and I am wondering if there is any way for me to improve on that. Indexing on a sentence level the overall amount of data stays the same while the number of records increases substantially (since there is usually many sentences to one web page). It seems to me like the indexing speed (everything else being the same) depends largely on the number of Documents inserted into the index, and not so much on the size of the data within the documents (correct?). I have played with the merge factor, using RAMDirectory, etc and I am quite comfortable with our overall configuration, so my guess is that that is not the issue (and I am QUITE happy with the indexing speed as long as I use complete pages and not sentences). Maybe there is a different way of attacking this? My goal is to be able to execute a query and get the sentences that match the query in the most efficient way while maintaining good/great indexing speed. I would prefer not having to search the complete document for the sentence in question. My current solution is to have one Lucene Document for each page (containing the URL and other information I require) that does NOT contain the text of the page. Then I have one Lucene Document for each sentence within that document, which contains the text of this particular sentence in addition to some identifying information that references the entry of the page itself. Any and all suggestions are welcome. Thanks! Jochen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED