Hey! I did it! Eric and Robert, you helped a lot. Thanks! I didn't use LucenePDFDocument. I created a new document for every page in a PDF document and added paga number info for every page.
PDDocument pddDocument=PDDocument.load(f); PDFTextStripper textStripper=new PDFTextStripper(); IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(), true); long start = new Date().getTime(); // 350 pages just for test for(int i=1; i<350; i++){ //System.out.println("i= " + i); textStripper.setStartPage(i); textStripper.setEndPage(i); //fetch one page pagecontent = textStripper.getText(pddDocument); System.out.println("pagecontent: " + pagecontent); if (pagecontent != null){ System.out.println("i= " + i); Document doc = new Document(); // Add the pagenumber doc.add(new Field("pagenumber", Integer.toString(i) , Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("content", pagecontent , Field.Store.NO, Field.Index.ANALYZED)); iwriter.addDocument(doc); } } // Optimize and close the writer to finish building the index iwriter.optimize(); iwriter.close(); long end = new Date().getTime(); System.out.println("Indexing files took " + (end - start) + " milliseconds"); //just for test I searched for a string cryptography String q = "cryptography"; Directory fsDir = FSDirectory.getDirectory(index_dir, false); IndexSearcher ind_searcher = new IndexSearcher(fsDir); // Build a Query object QueryParser parser = new QueryParser("content", new StandardAnalyzer()); Query query = parser.parse(q); // Search for the query Hits hits = ind_searcher.search(query); // Examine the Hits object to see if there were any matches int hitCount = hits.length(); if (hitCount == 0) { System.out.println( "No matches were found for \"" + q + "\""); } else { System.out.println("Hits for \"" + q + "\" were found in pages:"); // Iterate over the Documents in the Hits object for (int i = 0; i < hitCount; i++) { Document doc = hits.doc(i); // Print the value that we stored in the "title" field. Note // that this Field was not indexed, but (unlike the // "contents" field) was stored verbatim and can be // retrieved. //System.out.println(" " + (i + 1) + ". " + doc.get("title")); System.out.println(" " + (i + 1) + ". " + doc.get("pagenumber")); } } ind_searcher.close(); -------------------- I'm using lucene version 2.9.0 You said that Hits are deprecated. Should I use HitCollector instead? Another question came into my mind... What if I want do add another PDF document to the search pool. Before search I would like to specify the PDF document I would like to search and then return page number for searched String. I could create index for every document that I add to search pool but that doesn't sound good to me? Can you think of a better way to do that? Erick Erickson wrote: > > Your search would be on the "contents" field if you use LucenePDFDocument. > > But on a quick look, LucenePDFDocument doesn't give you any page > information. So, you'd have to collect that somehow, but I don't see a > clear > way to. > > Doing it manually, you could do something like: > > Document doc = new Document(); > for (each page in the document) { > doc.add("contents", <text for page>); > record the offset of the last term in the page you just indexed); > } > doc.add("metadata", <string representation of the page offsets>); > iw.addDocument(doc); > > Now, when you search you can get the offsets of the matching term, > then look in your metadata field for the page number. > > Perhaps you could use the LucenePDFDocument in conjunction with this > somehow, but I confess that I've never used it so it's not clear to me how > you'd do this. > > Incidentally, the Hits object is deprecated, what version of Lucene are > you intending to use? > > Best > Erick > > On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago <idrag...@gmail.com> wrote: > >> >> Thanks for the reply Erick. >> >> I would like to permanently index this content and search it >> multiple times so I would like a permanent copy and I want to search for >> different terms multiple >> times. >> >> My problem is that I dont know how to retrieve a page number where the >> searched string was found so >> if you could help on that issue, that would be great. >> >> // I would start like this: >> // This part of code would create the index, right? >> Document luceneDocument = LucenePDFDocument.getDocument( f ); >> IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(), >> true); >> iwriter.addDocument(luceneDocument); >> iwriter.close(); >> >> //and now for the search: >> Directory fsDir = FSDirectory.getDirectory(index_dir, false); >> IndexSearcher ind_search = new IndexSearcher(fsDir); >> >> //im not sure if "fieldname" would be the string that I'm searching? >> QueryParser parser = new QueryParser("fieldname", new >> StandardAnalyzer()); >> Query query = parser.parse(q); >> >> Hits hits = ind_search.search(query); >> >> //and I'm stuck here. Dont know how to retrieve the page number??? >> >> >> >> >> >> >> >> Erick Erickson wrote: >> > >> > It depends (tm). Do you want to permanently index this content and >> search >> > it >> > multiple times or is each search a one-off? If the latter, I'd look for >> > packages specific to handling PDF files. Although since Reader takes >> > forever >> > to search a document, so I suspect there's not much joy there. >> > If you want to parse the file once and search it many times, then yes, >> > Lucene can help a lot. You could conceivable do this in a memory index >> if >> > you didn't want a permanent copy. In this scheme, you'd index the file >> > before the first search then use the in-menory index until you were >> done >> > searching (assuming you wanted to search for different terms multiple >> > times). You'd have to do some record-keeping to remember what the start >> > and >> > end offset of each page was so you could deal with the case that a >> phrases >> > you search for started on one page and ended on another..... >> > >> > If this is off base, perhaps you could provide more details... >> > >> > Erick >> > >> > On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago <idrag...@gmail.com> wrote: >> > >> >> >> >> Hi, >> >> >> >> I have to search a single pdf document for requested string and if >> that >> >> string is found, I need to return a page number where that string was >> >> found. >> >> Requested string can be anything in a pdf document. >> >> >> >> It is a big document(abount 5000 pages) so I'm asking if that is >> possible >> >> with lucene. >> >> >> >> I'm using pdfbox class and i found a way to do it (searching with >> >> instring >> >> page by page) but it is too slow: >> >> >> >> PDDocument pddDocument=PDDocument.load(f); >> >> >> >> PDFTextStripper textStripper=new PDFTextStripper(); >> >> int lastpage = textStripper.getEndPage(); >> >> String page= null; >> >> int found= 0; >> >> >> >> for(int i=1; i<lastpage ; i++){ >> >> textStripper.setStartPage(i); >> >> textStripper.setEndPage(i); >> >> >> >> page = textStripper.getText(pddDocument); >> >> >> >> found = page .indexOf(searchtext); >> >> >> >> if (found>0) {returnpage= i; break;} >> >> } >> >> ---------------- >> >> >> >> Is there a way to speed up the search with lucene? Can I use indexing >> to >> >> solve this problem? thanks. >> >> >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html >> >> Sent from the Lucene - Java Developer mailing list archive at >> Nabble.com. >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25909908.html >> Sent from the Lucene - Java Developer mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> > > -- View this message in context: http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25924250.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org