cool. i will use compression and store in index. is there anything special i need to for decompressing the text? i presume i can just do doc.get("content")? thanks for your advice all!
On Sat, Mar 7, 2009 at 11:50 AM, Uwe Schindler <u...@thetaphi.de> wrote: > You could store the text contents compressed; I think extracting text from > PDF files is much more time-intensive than decompressing a stored field. > And > text-only contents often compress very good. In my opinion, if the > (uncompressed) contents of the docs are not very large (so I mean several > megabytes each), I would prefer storing it in index. > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -----Original Message----- > > From: Erik Hatcher [mailto:e...@ehatchersolutions.com] > > Sent: Saturday, March 07, 2009 12:46 PM > > To: java-user@lucene.apache.org > > Subject: Re: Lucene Highlighting and Dynamic Summaries > > > > It depends :) > > > > It's a trade-off. If storing is not prohibitive, I recommend that as > > it makes life easier for highlighting. > > > > Erik > > > > On Mar 7, 2009, at 6:37 AM, Amin Mohammed-Coleman wrote: > > > > > hi > > > that's what i was thinking about. i would need to get the file and > > > extract > > > the text again and then pass through the highlighter. The other > > > option is > > > storing the content in the index the downside being index is going > > > to be > > > large. Which would be the recommended approach? > > > > > > Cheers > > > > > > Amin > > > > > > On Sat, Mar 7, 2009 at 10:50 AM, Erik Hatcher > > <e...@ehatchersolutions.com > > > >wrote: > > > > > >> With the caveat that if you're not storing the text you want > > >> highlighted, > > >> you'll have to retrieve it somehow and send it into the Highlighter > > >> yourself. > > >> > > >> Erik > > >> > > >> > > >> On Mar 7, 2009, at 5:40 AM, Michael McCandless wrote: > > >> > > >> > > >>> You should look at contrib/highlighter, which does exactly this. > > >>> > > >>> Mike > > >>> > > >>> Amin Mohammed-Coleman wrote: > > >>> > > >>> Hi > > >>>> I am currently indexing documents (pdf, ms word, etc) that are > > >>>> uploaded, > > >>>> these documents can be searched and what the search returns to > > >>>> the user > > >>>> are > > >>>> summaries of the documents. Currently the summaries are > > >>>> extracted when > > >>>> indexing the file (summary constructed by taking the first 10 > > >>>> lines of > > >>>> the > > >>>> document and stored in the index as field). This is not ideal > > >>>> (static > > >>>> summary), and I was wondering if it would be possible to create a > > >>>> dynamic > > >>>> summary when a hit is found and highlight the terms found. The > > >>>> content > > >>>> of > > >>>> the document is not stored in the index. > > >>>> > > >>>> So basically what I'm looking to do is: > > >>>> > > >>>> 1) PDF indexed > > >>>> 2) PDF body contains the word "search" > > >>>> 3) Do a search and return the hit > > >>>> 4) Construct a summary with the term "search" included. > > >>>> > > >>>> I'm not sure how to go about doing this (I presume it is > > >>>> possible). I > > >>>> would > > >>>> be grateful for any advice. > > >>>> > > >>>> > > >>>> Cheers > > >>>> Amin > > >>>> > > >>> > > >>> > > >>> --------------------------------------------------------------------- > > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >>> > > >> > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> > > >> > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >