Re: search trough single pdf document - return page number

Erick Erickson Thu, 15 Oct 2009 09:04:51 -0700

Your search would be on the "contents" field if you use LucenePDFDocument.


But on a quick look, LucenePDFDocument doesn't give you any page
information. So, you'd have to collect that somehow, but I don't see a clear
way to.

Doing it manually, you could do something like:

Document doc = new Document();
for (each page in the document) {
  doc.add("contents", <text for page>);
  record the offset of the last term in the page you just indexed);
}
doc.add("metadata", <string representation of the page offsets>);
iw.addDocument(doc);

Now, when you search you can get the offsets of the matching term,
then look in your metadata field for the page number.

Perhaps you could use the LucenePDFDocument in conjunction with this
somehow, but I confess that I've never used it so it's not clear to me how
you'd do this.

Incidentally, the Hits object is deprecated, what version of Lucene are
you intending to use?

Best
Erick

On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago <[email protected]> wrote:

>
> Thanks for the reply Erick.
>
> I would like to permanently index this content and search it
> multiple times so I would like a permanent copy and I want to search for
> different terms multiple
> times.
>
> My problem is that I dont know how to retrieve a page number where the
> searched string was found so
> if you could help on that issue, that would be great.
>
> // I would start like this:
> // This part of code would create the index, right?
> Document luceneDocument = LucenePDFDocument.getDocument( f );
> IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(),
> true);
> iwriter.addDocument(luceneDocument);
> iwriter.close();
>
> //and now for the search:
> Directory fsDir = FSDirectory.getDirectory(index_dir, false);
> IndexSearcher ind_search = new IndexSearcher(fsDir);
>
> //im not sure if "fieldname" would be the string that I'm searching?
> QueryParser parser = new QueryParser("fieldname", new StandardAnalyzer());
> Query query = parser.parse(q);
>
> Hits hits = ind_search.search(query);
>
> //and I'm stuck here. Dont know how to retrieve the page number???
>
>
>
>
>
>
>
> Erick Erickson wrote:
> >
> > It depends (tm). Do you want to permanently index this content and search
> > it
> > multiple times or is each search a one-off? If the latter, I'd look for
> > packages specific to handling PDF files. Although since Reader takes
> > forever
> > to search a document, so I suspect there's not much joy there.
> > If you want to parse the file once and search it many times, then yes,
> > Lucene can help a lot. You could conceivable do this in a memory index if
> > you didn't want a permanent copy. In this scheme, you'd index the file
> > before the first search then use the in-menory index until you were done
> > searching (assuming you wanted to search for different terms multiple
> > times). You'd have to do some record-keeping to remember what the start
> > and
> > end offset of each page was so you could deal with the case that a
> phrases
> > you search for started on one page and ended on another.....
> >
> > If this is off base, perhaps you could provide more details...
> >
> > Erick
> >
> > On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago <[email protected]> wrote:
> >
> >>
> >> Hi,
> >>
> >> I have to search a single pdf document for requested string and if that
> >> string is found, I need to return a page number where that string was
> >> found.
> >> Requested string can be anything in a pdf document.
> >>
> >> It is a big document(abount 5000 pages) so I'm asking if that is
> possible
> >> with lucene.
> >>
> >> I'm using pdfbox class and i found a way to do it (searching with
> >> instring
> >> page by page) but it is too slow:
> >>
> >>        PDDocument pddDocument=PDDocument.load(f);
> >>
> >>        PDFTextStripper textStripper=new PDFTextStripper();
> >>        int lastpage = textStripper.getEndPage();
> >>        String page= null;
> >>        int found= 0;
> >>
> >>        for(int i=1; i<lastpage ; i++){
> >>            textStripper.setStartPage(i);
> >>            textStripper.setEndPage(i);
> >>
> >>            page = textStripper.getText(pddDocument);
> >>
> >>            found = page .indexOf(searchtext);
> >>
> >>            if (found>0) {returnpage= i; break;}
> >>        }
> >> ----------------
> >>
> >> Is there a way to speed up the search with lucene? Can I use indexing to
> >> solve this problem? thanks.
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html
> >> Sent from the Lucene - Java Developer mailing list archive at
> Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25909908.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: search trough single pdf document - return page number

Reply via email to