RE: Indexing Speed: Documents vs. Sentences

Jochen Frey Wed, 17 Dec 2003 15:07:29 -0800

Dan, I will send you a separate e-mail directly to your address.

In the meanwhile, I hope to get input from other people. Maybe someone else
knows how to solve my original problem below.


Thanks!
Jochen

> -----Original Message-----
> From: Dan Quaroni [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, December 17, 2003 1:36 PM
> To: 'Lucene Users List'
> Subject: RE: Indexing Speed: Documents vs. Sentences
> 
> When you parse the page you can prevent sentence-boundry hits from
> matching
> your criteria
> 
> -----Original Message-----
> From: Jochen Frey [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, December 17, 2003 4:34 PM
> To: 'Lucene Users List'
> Subject: RE: Indexing Speed: Documents vs. Sentences
> 
> 
> Right.
> 
> However, even if I do that, my problem #3 below remains unsolved: I do not
> wish to match phrases across sentence boundaries.
> 
> Anyone have a neat solution (or pointers to one)?
> 
> Thanks again!
> Jochen
> 
> > -----Original Message-----
> > From: Dan Quaroni [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, December 17, 2003 1:29 PM
> > To: 'Lucene Users List'
> > Subject: RE: Indexing Speed: Documents vs. Sentences
> >
> > Yeah.  I'd suggest parsing the page, unfortunately. :)
> >
> > -----Original Message-----
> > From: Jochen Frey [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, December 17, 2003 4:26 PM
> > To: 'Lucene Users List'
> > Subject: RE: Indexing Speed: Documents vs. Sentences
> >
> >
> > Hi!
> >
> > In essence:
> > 1) I don't care about the whole page
> >
> > 2) I only care about the actual sentence that matches the query.
> >
> > 3) I want the matching for the query only to happen within one sentence
> > and
> > not over sentence boundaries (even when I do a PhraseQuery with some
> > slop).
> >
> > The query: "i like the beach"~20
> > should not match: "And we go to the restaurant and i really like it. the
> > beach was wonderful as well".
> >
> > 4) I would much prefer not to parse the actual page to find the sentence
> > that matches the query (though I obviously will, if I have to).
> >
> > Does that answer your question?
> >
> > Thanks!
> > Jochen
> >
> > > -----Original Message-----
> > > From: Dan Quaroni [mailto:[EMAIL PROTECTED]
> > > Sent: Wednesday, December 17, 2003 1:19 PM
> > > To: 'Lucene Users List'
> > > Subject: RE: Indexing Speed: Documents vs. Sentences
> > >
> > > I'm confused about something - what's the point of creating a document
> > for
> > > every sentence?
> > >
> > > -----Original Message-----
> > > From: Jochen Frey [mailto:[EMAIL PROTECTED]
> > > Sent: Wednesday, December 17, 2003 4:17 PM
> > > To: 'Lucene Users List'
> > > Subject: Indexing Speed: Documents vs. Sentences
> > >
> > >
> > > Hi,
> > >
> > > I am using Lucene to index a large number of web pages (a few 100GB)
> and
> > > the
> > > indexing speed is great.
> > >
> > > Lately I have been trying to index on a sentence level, not the
> document
> > > level. My problem is that the indexing speed has gone down
> dramatically
> > > and
> > > I am wondering if there is any way for me to improve on that.
> > >
> > > Indexing on a sentence level the overall amount of data stays the same
> > > while
> > > the number of records increases substantially (since there is usually
> > many
> > > sentences to one web page).
> > >
> > > It seems to me like the indexing speed (everything else being the
> same)
> > > depends largely on the number of Documents inserted into the index,
> and
> > > not
> > > so much on the size of the data within the documents (correct?).
> > >
> > > I have played with the merge factor, using RAMDirectory, etc and I am
> > > quite
> > > comfortable with our overall configuration, so my guess is that that
> is
> > > not
> > > the issue (and I am QUITE happy with the indexing speed as long as I
> use
> > > complete pages and not sentences).
> > >
> > > Maybe there is a different way of attacking this? My goal is to be
> able
> > to
> > > execute a query and get the sentences that match the query in the most
> > > efficient way while maintaining good/great indexing speed. I would
> > prefer
> > > not having to search the complete document for the sentence in
> question.
> > >
> > > My current solution is to have one Lucene Document for each page
> > > (containing
> > > the URL and other information I require) that does NOT contain the
> text
> > of
> > > the page. Then I have one Lucene Document for each sentence within
> that
> > > document, which contains the text of this particular sentence in
> > addition
> > > to
> > > some identifying information that references the entry of the page
> > itself.
> > >
> > > Any and all suggestions are welcome.
> > >
> > > Thanks!
> > > Jochen
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Indexing Speed: Documents vs. Sentences

Reply via email to