Dan, I will send you a separate e-mail directly to your address. In the meanwhile, I hope to get input from other people. Maybe someone else knows how to solve my original problem below.
Thanks! Jochen > -----Original Message----- > From: Dan Quaroni [mailto:[EMAIL PROTECTED] > Sent: Wednesday, December 17, 2003 1:36 PM > To: 'Lucene Users List' > Subject: RE: Indexing Speed: Documents vs. Sentences > > When you parse the page you can prevent sentence-boundry hits from > matching > your criteria > > -----Original Message----- > From: Jochen Frey [mailto:[EMAIL PROTECTED] > Sent: Wednesday, December 17, 2003 4:34 PM > To: 'Lucene Users List' > Subject: RE: Indexing Speed: Documents vs. Sentences > > > Right. > > However, even if I do that, my problem #3 below remains unsolved: I do not > wish to match phrases across sentence boundaries. > > Anyone have a neat solution (or pointers to one)? > > Thanks again! > Jochen > > > -----Original Message----- > > From: Dan Quaroni [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, December 17, 2003 1:29 PM > > To: 'Lucene Users List' > > Subject: RE: Indexing Speed: Documents vs. Sentences > > > > Yeah. I'd suggest parsing the page, unfortunately. :) > > > > -----Original Message----- > > From: Jochen Frey [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, December 17, 2003 4:26 PM > > To: 'Lucene Users List' > > Subject: RE: Indexing Speed: Documents vs. Sentences > > > > > > Hi! > > > > In essence: > > 1) I don't care about the whole page > > > > 2) I only care about the actual sentence that matches the query. > > > > 3) I want the matching for the query only to happen within one sentence > > and > > not over sentence boundaries (even when I do a PhraseQuery with some > > slop). > > > > The query: "i like the beach"~20 > > should not match: "And we go to the restaurant and i really like it. the > > beach was wonderful as well". > > > > 4) I would much prefer not to parse the actual page to find the sentence > > that matches the query (though I obviously will, if I have to). > > > > Does that answer your question? > > > > Thanks! > > Jochen > > > > > -----Original Message----- > > > From: Dan Quaroni [mailto:[EMAIL PROTECTED] > > > Sent: Wednesday, December 17, 2003 1:19 PM > > > To: 'Lucene Users List' > > > Subject: RE: Indexing Speed: Documents vs. Sentences > > > > > > I'm confused about something - what's the point of creating a document > > for > > > every sentence? > > > > > > -----Original Message----- > > > From: Jochen Frey [mailto:[EMAIL PROTECTED] > > > Sent: Wednesday, December 17, 2003 4:17 PM > > > To: 'Lucene Users List' > > > Subject: Indexing Speed: Documents vs. Sentences > > > > > > > > > Hi, > > > > > > I am using Lucene to index a large number of web pages (a few 100GB) > and > > > the > > > indexing speed is great. > > > > > > Lately I have been trying to index on a sentence level, not the > document > > > level. My problem is that the indexing speed has gone down > dramatically > > > and > > > I am wondering if there is any way for me to improve on that. > > > > > > Indexing on a sentence level the overall amount of data stays the same > > > while > > > the number of records increases substantially (since there is usually > > many > > > sentences to one web page). > > > > > > It seems to me like the indexing speed (everything else being the > same) > > > depends largely on the number of Documents inserted into the index, > and > > > not > > > so much on the size of the data within the documents (correct?). > > > > > > I have played with the merge factor, using RAMDirectory, etc and I am > > > quite > > > comfortable with our overall configuration, so my guess is that that > is > > > not > > > the issue (and I am QUITE happy with the indexing speed as long as I > use > > > complete pages and not sentences). > > > > > > Maybe there is a different way of attacking this? My goal is to be > able > > to > > > execute a query and get the sentences that match the query in the most > > > efficient way while maintaining good/great indexing speed. I would > > prefer > > > not having to search the complete document for the sentence in > question. > > > > > > My current solution is to have one Lucene Document for each page > > > (containing > > > the URL and other information I require) that does NOT contain the > text > > of > > > the page. Then I have one Lucene Document for each sentence within > that > > > document, which contains the text of this particular sentence in > > addition > > > to > > > some identifying information that references the entry of the page > > itself. > > > > > > Any and all suggestions are welcome. > > > > > > Thanks! > > > Jochen > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]