Yes, I tough of that too but i didn't know if I could search trough index only documents that have specific field name. After some researching I found a way to do that:
String q = "title:ant"; Query query = parser.parse(q); title:ant -> Contain the term ant in the title field Regards, Ivan Erick Erickson wrote: > > Well, you have to add another field to each document identifying thePDF it > came from. From there, restricting to that doc just becomes > adding an AND clause. Of course how you specify these is "an > exercise left to the reader" <G>. > > Erick > > On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago <idrag...@gmail.com> wrote: > >> >> Hey! I did it! Eric and Robert, you helped a lot. Thanks! >> >> I didn't use LucenePDFDocument. I created a new document for every page >> in >> a >> PDF document and added paga number info for every page. >> >> PDDocument pddDocument=PDDocument.load(f); >> PDFTextStripper textStripper=new PDFTextStripper(); >> >> IndexWriter iwriter = new IndexWriter(index_dir, new >> StandardAnalyzer(), true); >> >> long start = new Date().getTime(); >> >> // 350 pages just for test >> for(int i=1; i<350; i++){ >> //System.out.println("i= " + i); >> textStripper.setStartPage(i); >> textStripper.setEndPage(i); >> >> //fetch one page >> pagecontent = textStripper.getText(pddDocument); >> System.out.println("pagecontent: " + pagecontent); >> >> if (pagecontent != null){ >> System.out.println("i= " + i); >> Document doc = new Document(); >> >> // Add the pagenumber >> doc.add(new Field("pagenumber", Integer.toString(i) , >> Field.Store.YES, >> Field.Index.ANALYZED)); >> doc.add(new Field("content", pagecontent , >> Field.Store.NO, >> Field.Index.ANALYZED)); >> >> iwriter.addDocument(doc); >> } >> >> } >> >> // Optimize and close the writer to finish building the index >> iwriter.optimize(); >> iwriter.close(); >> >> long end = new Date().getTime(); >> >> System.out.println("Indexing files took " >> + (end - start) + " milliseconds"); >> >> //just for test I searched for a string cryptography >> String q = "cryptography"; >> >> Directory fsDir = FSDirectory.getDirectory(index_dir, false); >> IndexSearcher ind_searcher = new IndexSearcher(fsDir); >> >> // Build a Query object >> QueryParser parser = new QueryParser("content", new >> StandardAnalyzer()); >> Query query = parser.parse(q); >> >> // Search for the query >> Hits hits = ind_searcher.search(query); >> >> // Examine the Hits object to see if there were any matches >> int hitCount = hits.length(); >> if (hitCount == 0) { >> System.out.println( >> "No matches were found for \"" + q + "\""); >> } >> else { >> System.out.println("Hits for \"" + >> q + "\" were found in pages:"); >> >> // Iterate over the Documents in the Hits object >> for (int i = 0; i < hitCount; i++) { >> Document doc = hits.doc(i); >> >> // Print the value that we stored in the "title" field. >> Note >> // that this Field was not indexed, but (unlike the >> // "contents" field) was stored verbatim and can be >> // retrieved. >> //System.out.println(" " + (i + 1) + ". " + >> doc.get("title")); >> System.out.println(" " + (i + 1) + ". " + >> doc.get("pagenumber")); >> } >> } >> ind_searcher.close(); >> >> -------------------- >> I'm using lucene version 2.9.0 >> You said that Hits are deprecated. Should I use HitCollector instead? >> >> Another question came into my mind... What if I want do add another PDF >> document to the search pool. Before search I would like to specify the >> PDF >> document I would like to search and then return page number for searched >> String. I could create index for every document that I add to search pool >> but that doesn't sound good to me? Can you think of a better way to do >> that? >> >> >> Erick Erickson wrote: >> > >> > Your search would be on the "contents" field if you use >> LucenePDFDocument. >> > >> > But on a quick look, LucenePDFDocument doesn't give you any page >> > information. So, you'd have to collect that somehow, but I don't see a >> > clear >> > way to. >> > >> > Doing it manually, you could do something like: >> > >> > Document doc = new Document(); >> > for (each page in the document) { >> > doc.add("contents", <text for page>); >> > record the offset of the last term in the page you just indexed); >> > } >> > doc.add("metadata", <string representation of the page offsets>); >> > iw.addDocument(doc); >> > >> > Now, when you search you can get the offsets of the matching term, >> > then look in your metadata field for the page number. >> > >> > Perhaps you could use the LucenePDFDocument in conjunction with this >> > somehow, but I confess that I've never used it so it's not clear to me >> how >> > you'd do this. >> > >> > Incidentally, the Hits object is deprecated, what version of Lucene are >> > you intending to use? >> > >> > Best >> > Erick >> > >> > On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago <idrag...@gmail.com> wrote: >> > >> >> >> >> Thanks for the reply Erick. >> >> >> >> I would like to permanently index this content and search it >> >> multiple times so I would like a permanent copy and I want to search >> for >> >> different terms multiple >> >> times. >> >> >> >> My problem is that I dont know how to retrieve a page number where the >> >> searched string was found so >> >> if you could help on that issue, that would be great. >> >> >> >> // I would start like this: >> >> // This part of code would create the index, right? >> >> Document luceneDocument = LucenePDFDocument.getDocument( f ); >> >> IndexWriter iwriter = new IndexWriter(index_dir, new >> StandardAnalyzer(), >> >> true); >> >> iwriter.addDocument(luceneDocument); >> >> iwriter.close(); >> >> >> >> //and now for the search: >> >> Directory fsDir = FSDirectory.getDirectory(index_dir, false); >> >> IndexSearcher ind_search = new IndexSearcher(fsDir); >> >> >> >> //im not sure if "fieldname" would be the string that I'm searching? >> >> QueryParser parser = new QueryParser("fieldname", new >> >> StandardAnalyzer()); >> >> Query query = parser.parse(q); >> >> >> >> Hits hits = ind_search.search(query); >> >> >> >> //and I'm stuck here. Dont know how to retrieve the page number??? >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Erick Erickson wrote: >> >> > >> >> > It depends (tm). Do you want to permanently index this content and >> >> search >> >> > it >> >> > multiple times or is each search a one-off? If the latter, I'd look >> for >> >> > packages specific to handling PDF files. Although since Reader takes >> >> > forever >> >> > to search a document, so I suspect there's not much joy there. >> >> > If you want to parse the file once and search it many times, then >> yes, >> >> > Lucene can help a lot. You could conceivable do this in a memory >> index >> >> if >> >> > you didn't want a permanent copy. In this scheme, you'd index the >> file >> >> > before the first search then use the in-menory index until you were >> >> done >> >> > searching (assuming you wanted to search for different terms >> multiple >> >> > times). You'd have to do some record-keeping to remember what the >> start >> >> > and >> >> > end offset of each page was so you could deal with the case that a >> >> phrases >> >> > you search for started on one page and ended on another..... >> >> > >> >> > If this is off base, perhaps you could provide more details... >> >> > >> >> > Erick >> >> > >> >> > On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago <idrag...@gmail.com> >> wrote: >> >> > >> >> >> >> >> >> Hi, >> >> >> >> >> >> I have to search a single pdf document for requested string and if >> >> that >> >> >> string is found, I need to return a page number where that string >> was >> >> >> found. >> >> >> Requested string can be anything in a pdf document. >> >> >> >> >> >> It is a big document(abount 5000 pages) so I'm asking if that is >> >> possible >> >> >> with lucene. >> >> >> >> >> >> I'm using pdfbox class and i found a way to do it (searching with >> >> >> instring >> >> >> page by page) but it is too slow: >> >> >> >> >> >> PDDocument pddDocument=PDDocument.load(f); >> >> >> >> >> >> PDFTextStripper textStripper=new PDFTextStripper(); >> >> >> int lastpage = textStripper.getEndPage(); >> >> >> String page= null; >> >> >> int found= 0; >> >> >> >> >> >> for(int i=1; i<lastpage ; i++){ >> >> >> textStripper.setStartPage(i); >> >> >> textStripper.setEndPage(i); >> >> >> >> >> >> page = textStripper.getText(pddDocument); >> >> >> >> >> >> found = page .indexOf(searchtext); >> >> >> >> >> >> if (found>0) {returnpage= i; break;} >> >> >> } >> >> >> ---------------- >> >> >> >> >> >> Is there a way to speed up the search with lucene? Can I use >> indexing >> >> to >> >> >> solve this problem? thanks. >> >> >> >> >> >> -- >> >> >> View this message in context: >> >> >> >> >> >> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html >> >> >> Sent from the Lucene - Java Developer mailing list archive at >> >> Nabble.com. >> >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> >> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> >> >> >> >> >> >> > >> >> > >> >> >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25909908.html >> >> Sent from the Lucene - Java Developer mailing list archive at >> Nabble.com. >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25924250.html >> Sent from the Lucene - Java Developer mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> > > -- View this message in context: http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25925272.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org