Hi! In essence: 1) I don't care about the whole page
2) I only care about the actual sentence that matches the query. 3) I want the matching for the query only to happen within one sentence and not over sentence boundaries (even when I do a PhraseQuery with some slop). The query: "i like the beach"~20 should not match: "And we go to the restaurant and i really like it. the beach was wonderful as well". 4) I would much prefer not to parse the actual page to find the sentence that matches the query (though I obviously will, if I have to). Does that answer your question? Thanks! Jochen > -----Original Message----- > From: Dan Quaroni [mailto:[EMAIL PROTECTED] > Sent: Wednesday, December 17, 2003 1:19 PM > To: 'Lucene Users List' > Subject: RE: Indexing Speed: Documents vs. Sentences > > I'm confused about something - what's the point of creating a document for > every sentence? > > -----Original Message----- > From: Jochen Frey [mailto:[EMAIL PROTECTED] > Sent: Wednesday, December 17, 2003 4:17 PM > To: 'Lucene Users List' > Subject: Indexing Speed: Documents vs. Sentences > > > Hi, > > I am using Lucene to index a large number of web pages (a few 100GB) and > the > indexing speed is great. > > Lately I have been trying to index on a sentence level, not the document > level. My problem is that the indexing speed has gone down dramatically > and > I am wondering if there is any way for me to improve on that. > > Indexing on a sentence level the overall amount of data stays the same > while > the number of records increases substantially (since there is usually many > sentences to one web page). > > It seems to me like the indexing speed (everything else being the same) > depends largely on the number of Documents inserted into the index, and > not > so much on the size of the data within the documents (correct?). > > I have played with the merge factor, using RAMDirectory, etc and I am > quite > comfortable with our overall configuration, so my guess is that that is > not > the issue (and I am QUITE happy with the indexing speed as long as I use > complete pages and not sentences). > > Maybe there is a different way of attacking this? My goal is to be able to > execute a query and get the sentences that match the query in the most > efficient way while maintaining good/great indexing speed. I would prefer > not having to search the complete document for the sentence in question. > > My current solution is to have one Lucene Document for each page > (containing > the URL and other information I require) that does NOT contain the text of > the page. Then I have one Lucene Document for each sentence within that > document, which contains the text of this particular sentence in addition > to > some identifying information that references the entry of the page itself. > > Any and all suggestions are welcome. > > Thanks! > Jochen > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]