Re: search trough single pdf document - return page number
Yes, I tough of that too but i didn't know if I could search trough index only documents that have specific field name. After some researching I found a way to do that: String q = "title:ant"; Query query = parser.parse(q); title:ant -> Contain the term ant in the title field Regards, Ivan Erick Erickson wrote: > > Well, you have to add another field to each document identifying thePDF it > came from. From there, restricting to that doc just becomes > adding an AND clause. Of course how you specify these is "an > exercise left to the reader" . > > Erick > > On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago wrote: > >> >> Hey! I did it! Eric and Robert, you helped a lot. Thanks! >> >> I didn't use LucenePDFDocument. I created a new document for every page >> in >> a >> PDF document and added paga number info for every page. >> >>PDDocument pddDocument=PDDocument.load(f); >>PDFTextStripper textStripper=new PDFTextStripper(); >> >> IndexWriter iwriter = new IndexWriter(index_dir, new >> StandardAnalyzer(), true); >> >> long start = new Date().getTime(); >> >>// 350 pages just for test >>for(int i=1; i<350; i++){ >>//System.out.println("i= " + i); >> textStripper.setStartPage(i); >>textStripper.setEndPage(i); >> >> //fetch one page >>pagecontent = textStripper.getText(pddDocument); >>System.out.println("pagecontent: " + pagecontent); >> >>if (pagecontent != null){ >>System.out.println("i= " + i); >>Document doc = new Document(); >> >>// Add the pagenumber >>doc.add(new Field("pagenumber", Integer.toString(i) , >> Field.Store.YES, >>Field.Index.ANALYZED)); >>doc.add(new Field("content", pagecontent , >> Field.Store.NO, >>Field.Index.ANALYZED)); >> >>iwriter.addDocument(doc); >>} >> >>} >> >>// Optimize and close the writer to finish building the index >>iwriter.optimize(); >>iwriter.close(); >> >>long end = new Date().getTime(); >> >>System.out.println("Indexing files took " >>+ (end - start) + " milliseconds"); >> >>//just for test I searched for a string cryptography >>String q = "cryptography"; >> >>Directory fsDir = FSDirectory.getDirectory(index_dir, false); >> IndexSearcher ind_searcher = new IndexSearcher(fsDir); >> >>// Build a Query object >>QueryParser parser = new QueryParser("content", new >> StandardAnalyzer()); >>Query query = parser.parse(q); >> >> // Search for the query >>Hits hits = ind_searcher.search(query); >> >>// Examine the Hits object to see if there were any matches >>int hitCount = hits.length(); >>if (hitCount == 0) { >>System.out.println( >>"No matches were found for \"" + q + "\""); >>} >>else { >>System.out.println("Hits for \"" + >>q + "\" were found in pages:"); >> >>// Iterate over the Documents in the Hits object >>for (int i = 0; i < hitCount; i++) { >>Document doc = hits.doc(i); >> >>// Print the value that we stored in the "title" field. >> Note >>// that this Field was not indexed, but (unlike the >>// "contents" field) was stored verbatim and can be >>// retrieved. >>//System.out.println(" " + (i + 1) + ". " + >> doc.get("title")); >>System.out.println(" " + (i + 1) + ". " + >> doc.get("pagenumber")); >>} >>} >>ind_searcher.close(); >> >> >> I'm using lucene version 2.9.0 >> You said that Hits are deprecated. Should I use HitCollector instead? >> >> Another question came into my mind... What if I want do add another PDF >> document to the search pool. Before search I would like to specify the >> PDF >> document I would like to search and then return page number for searched >> String. I could create index for every document that I add to search pool >> but that doesn't sound good to me? Can you think of a better way to do >> that? >> >> >> Erick Erickson wrote: >> > >> > Your search would be on the "contents" field if you use >> LucenePDFDocument. >> > >> > But on a quick look, LucenePDFDocument doesn't give you any page >> > information. So, you'd have to collect that somehow, but I don't see a >> > clear >> > way to. >> > >> > Doing it manually, you could do something like: >> > >> > Document doc = new Document(); >> > for (each page in the document) { >> > doc.add("contents", ); >> > record the offset of the last term in the page you just indexed); >> > } >> > doc.add("metadata", ); >> > iw.addDocument(doc); >> > >> > Now, when you search y
Re: search trough single pdf document - return page number
Well, you have to add another field to each document identifying thePDF it came from. From there, restricting to that doc just becomes adding an AND clause. Of course how you specify these is "an exercise left to the reader" . Erick On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago wrote: > > Hey! I did it! Eric and Robert, you helped a lot. Thanks! > > I didn't use LucenePDFDocument. I created a new document for every page in > a > PDF document and added paga number info for every page. > >PDDocument pddDocument=PDDocument.load(f); >PDFTextStripper textStripper=new PDFTextStripper(); > > IndexWriter iwriter = new IndexWriter(index_dir, new > StandardAnalyzer(), true); > > long start = new Date().getTime(); > >// 350 pages just for test >for(int i=1; i<350; i++){ >//System.out.println("i= " + i); > textStripper.setStartPage(i); >textStripper.setEndPage(i); > > //fetch one page >pagecontent = textStripper.getText(pddDocument); >System.out.println("pagecontent: " + pagecontent); > >if (pagecontent != null){ >System.out.println("i= " + i); >Document doc = new Document(); > >// Add the pagenumber >doc.add(new Field("pagenumber", Integer.toString(i) , > Field.Store.YES, >Field.Index.ANALYZED)); >doc.add(new Field("content", pagecontent , > Field.Store.NO, >Field.Index.ANALYZED)); > >iwriter.addDocument(doc); >} > >} > >// Optimize and close the writer to finish building the index >iwriter.optimize(); >iwriter.close(); > >long end = new Date().getTime(); > >System.out.println("Indexing files took " >+ (end - start) + " milliseconds"); > >//just for test I searched for a string cryptography >String q = "cryptography"; > >Directory fsDir = FSDirectory.getDirectory(index_dir, false); > IndexSearcher ind_searcher = new IndexSearcher(fsDir); > >// Build a Query object >QueryParser parser = new QueryParser("content", new > StandardAnalyzer()); >Query query = parser.parse(q); > > // Search for the query >Hits hits = ind_searcher.search(query); > >// Examine the Hits object to see if there were any matches >int hitCount = hits.length(); >if (hitCount == 0) { >System.out.println( >"No matches were found for \"" + q + "\""); >} >else { >System.out.println("Hits for \"" + >q + "\" were found in pages:"); > >// Iterate over the Documents in the Hits object >for (int i = 0; i < hitCount; i++) { >Document doc = hits.doc(i); > >// Print the value that we stored in the "title" field. Note >// that this Field was not indexed, but (unlike the >// "contents" field) was stored verbatim and can be >// retrieved. >//System.out.println(" " + (i + 1) + ". " + > doc.get("title")); >System.out.println(" " + (i + 1) + ". " + > doc.get("pagenumber")); >} >} >ind_searcher.close(); > > > I'm using lucene version 2.9.0 > You said that Hits are deprecated. Should I use HitCollector instead? > > Another question came into my mind... What if I want do add another PDF > document to the search pool. Before search I would like to specify the PDF > document I would like to search and then return page number for searched > String. I could create index for every document that I add to search pool > but that doesn't sound good to me? Can you think of a better way to do > that? > > > Erick Erickson wrote: > > > > Your search would be on the "contents" field if you use > LucenePDFDocument. > > > > But on a quick look, LucenePDFDocument doesn't give you any page > > information. So, you'd have to collect that somehow, but I don't see a > > clear > > way to. > > > > Doing it manually, you could do something like: > > > > Document doc = new Document(); > > for (each page in the document) { > > doc.add("contents", ); > > record the offset of the last term in the page you just indexed); > > } > > doc.add("metadata", ); > > iw.addDocument(doc); > > > > Now, when you search you can get the offsets of the matching term, > > then look in your metadata field for the page number. > > > > Perhaps you could use the LucenePDFDocument in conjunction with this > > somehow, but I confess that I've never used it so it's not clear to me > how > > you'd do this. > > > > Incidentally, the Hits object is deprecated, what version of Lucene are > > you intending to use? > > > > Best > > Erick > > > > On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago wrote: > >
Re: search trough single pdf document - return page number
proximity queries that span pages are not a concern in my case. I asked another question on the bottom of my last post. Could you comment on that If you have some ideas? Erick Erickson wrote: > > Glad things are progressing. The only problem here will be > proximityqueries > that span pages. Say, the last word on page 10 is > "salmon" and the first word on page 11 is "fishing". Structuring > your index this way won't find the a proximity search for "salmon > fishing". > > If that's not a concern, then there's no reason to complexify the > situation.. > > FWIW > Erick > > On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago wrote: > >> >> Hey! I did it! Eric and Robert, you helped a lot. Thanks! >> >> I didn't use LucenePDFDocument. I created a new document for every page >> in >> a >> PDF document and added paga number info for every page. >> >>PDDocument pddDocument=PDDocument.load(f); >>PDFTextStripper textStripper=new PDFTextStripper(); >> >> IndexWriter iwriter = new IndexWriter(index_dir, new >> StandardAnalyzer(), true); >> >> long start = new Date().getTime(); >> >>// 350 pages just for test >>for(int i=1; i<350; i++){ >>//System.out.println("i= " + i); >> textStripper.setStartPage(i); >>textStripper.setEndPage(i); >> >> //fetch one page >>pagecontent = textStripper.getText(pddDocument); >>System.out.println("pagecontent: " + pagecontent); >> >>if (pagecontent != null){ >>System.out.println("i= " + i); >>Document doc = new Document(); >> >>// Add the pagenumber >>doc.add(new Field("pagenumber", Integer.toString(i) , >> Field.Store.YES, >>Field.Index.ANALYZED)); >>doc.add(new Field("content", pagecontent , >> Field.Store.NO, >>Field.Index.ANALYZED)); >> >>iwriter.addDocument(doc); >>} >> >>} >> >>// Optimize and close the writer to finish building the index >>iwriter.optimize(); >>iwriter.close(); >> >>long end = new Date().getTime(); >> >>System.out.println("Indexing files took " >>+ (end - start) + " milliseconds"); >> >>//just for test I searched for a string cryptography >>String q = "cryptography"; >> >>Directory fsDir = FSDirectory.getDirectory(index_dir, false); >> IndexSearcher ind_searcher = new IndexSearcher(fsDir); >> >>// Build a Query object >>QueryParser parser = new QueryParser("content", new >> StandardAnalyzer()); >>Query query = parser.parse(q); >> >> // Search for the query >>Hits hits = ind_searcher.search(query); >> >>// Examine the Hits object to see if there were any matches >>int hitCount = hits.length(); >>if (hitCount == 0) { >>System.out.println( >>"No matches were found for \"" + q + "\""); >>} >>else { >>System.out.println("Hits for \"" + >>q + "\" were found in pages:"); >> >>// Iterate over the Documents in the Hits object >>for (int i = 0; i < hitCount; i++) { >>Document doc = hits.doc(i); >> >>// Print the value that we stored in the "title" field. >> Note >>// that this Field was not indexed, but (unlike the >>// "contents" field) was stored verbatim and can be >>// retrieved. >>//System.out.println(" " + (i + 1) + ". " + >> doc.get("title")); >>System.out.println(" " + (i + 1) + ". " + >> doc.get("pagenumber")); >>} >>} >>ind_searcher.close(); >> >> >> I'm using lucene version 2.9.0 >> You said that Hits are deprecated. Should I use HitCollector instead? >> >> Another question came into my mind... What if I want do add another PDF >> document to the search pool. Before search I would like to specify the >> PDF >> document I would like to search and then return page number for searched >> String. I could create index for every document that I add to search pool >> but that doesn't sound good to me? Can you think of a better way to do >> that? >> >> >> Erick Erickson wrote: >> > >> > Your search would be on the "contents" field if you use >> LucenePDFDocument. >> > >> > But on a quick look, LucenePDFDocument doesn't give you any page >> > information. So, you'd have to collect that somehow, but I don't see a >> > clear >> > way to. >> > >> > Doing it manually, you could do something like: >> > >> > Document doc = new Document(); >> > for (each page in the document) { >> > doc.add("contents", ); >> > record the offset of the last term in the page you just indexed); >> > } >> > doc.add("metadata", ); >> > iw.addDocument(doc); >> > >> > Now, when
Re: search trough single pdf document - return page number
Glad things are progressing. The only problem here will be proximityqueries that span pages. Say, the last word on page 10 is "salmon" and the first word on page 11 is "fishing". Structuring your index this way won't find the a proximity search for "salmon fishing". If that's not a concern, then there's no reason to complexify the situation.. FWIW Erick On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago wrote: > > Hey! I did it! Eric and Robert, you helped a lot. Thanks! > > I didn't use LucenePDFDocument. I created a new document for every page in > a > PDF document and added paga number info for every page. > >PDDocument pddDocument=PDDocument.load(f); >PDFTextStripper textStripper=new PDFTextStripper(); > > IndexWriter iwriter = new IndexWriter(index_dir, new > StandardAnalyzer(), true); > > long start = new Date().getTime(); > >// 350 pages just for test >for(int i=1; i<350; i++){ >//System.out.println("i= " + i); > textStripper.setStartPage(i); >textStripper.setEndPage(i); > > //fetch one page >pagecontent = textStripper.getText(pddDocument); >System.out.println("pagecontent: " + pagecontent); > >if (pagecontent != null){ >System.out.println("i= " + i); >Document doc = new Document(); > >// Add the pagenumber >doc.add(new Field("pagenumber", Integer.toString(i) , > Field.Store.YES, >Field.Index.ANALYZED)); >doc.add(new Field("content", pagecontent , > Field.Store.NO, >Field.Index.ANALYZED)); > >iwriter.addDocument(doc); >} > >} > >// Optimize and close the writer to finish building the index >iwriter.optimize(); >iwriter.close(); > >long end = new Date().getTime(); > >System.out.println("Indexing files took " >+ (end - start) + " milliseconds"); > >//just for test I searched for a string cryptography >String q = "cryptography"; > >Directory fsDir = FSDirectory.getDirectory(index_dir, false); > IndexSearcher ind_searcher = new IndexSearcher(fsDir); > >// Build a Query object >QueryParser parser = new QueryParser("content", new > StandardAnalyzer()); >Query query = parser.parse(q); > > // Search for the query >Hits hits = ind_searcher.search(query); > >// Examine the Hits object to see if there were any matches >int hitCount = hits.length(); >if (hitCount == 0) { >System.out.println( >"No matches were found for \"" + q + "\""); >} >else { >System.out.println("Hits for \"" + >q + "\" were found in pages:"); > >// Iterate over the Documents in the Hits object >for (int i = 0; i < hitCount; i++) { >Document doc = hits.doc(i); > >// Print the value that we stored in the "title" field. Note >// that this Field was not indexed, but (unlike the >// "contents" field) was stored verbatim and can be >// retrieved. >//System.out.println(" " + (i + 1) + ". " + > doc.get("title")); >System.out.println(" " + (i + 1) + ". " + > doc.get("pagenumber")); >} >} >ind_searcher.close(); > > > I'm using lucene version 2.9.0 > You said that Hits are deprecated. Should I use HitCollector instead? > > Another question came into my mind... What if I want do add another PDF > document to the search pool. Before search I would like to specify the PDF > document I would like to search and then return page number for searched > String. I could create index for every document that I add to search pool > but that doesn't sound good to me? Can you think of a better way to do > that? > > > Erick Erickson wrote: > > > > Your search would be on the "contents" field if you use > LucenePDFDocument. > > > > But on a quick look, LucenePDFDocument doesn't give you any page > > information. So, you'd have to collect that somehow, but I don't see a > > clear > > way to. > > > > Doing it manually, you could do something like: > > > > Document doc = new Document(); > > for (each page in the document) { > > doc.add("contents", ); > > record the offset of the last term in the page you just indexed); > > } > > doc.add("metadata", ); > > iw.addDocument(doc); > > > > Now, when you search you can get the offsets of the matching term, > > then look in your metadata field for the page number. > > > > Perhaps you could use the LucenePDFDocument in conjunction with this > > somehow, but I confess that I've never used it so it's not clear to me > how > > you'd do this. > > > > Incidentally, the Hits object is deprecated, what version of
Re: search trough single pdf document - return page number
Hey! I did it! Eric and Robert, you helped a lot. Thanks! I didn't use LucenePDFDocument. I created a new document for every page in a PDF document and added paga number info for every page. PDDocument pddDocument=PDDocument.load(f); PDFTextStripper textStripper=new PDFTextStripper(); IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(), true); long start = new Date().getTime(); // 350 pages just for test for(int i=1; i<350; i++){ //System.out.println("i= " + i); textStripper.setStartPage(i); textStripper.setEndPage(i); //fetch one page pagecontent = textStripper.getText(pddDocument); System.out.println("pagecontent: " + pagecontent); if (pagecontent != null){ System.out.println("i= " + i); Document doc = new Document(); // Add the pagenumber doc.add(new Field("pagenumber", Integer.toString(i) , Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("content", pagecontent , Field.Store.NO, Field.Index.ANALYZED)); iwriter.addDocument(doc); } } // Optimize and close the writer to finish building the index iwriter.optimize(); iwriter.close(); long end = new Date().getTime(); System.out.println("Indexing files took " + (end - start) + " milliseconds"); //just for test I searched for a string cryptography String q = "cryptography"; Directory fsDir = FSDirectory.getDirectory(index_dir, false); IndexSearcher ind_searcher = new IndexSearcher(fsDir); // Build a Query object QueryParser parser = new QueryParser("content", new StandardAnalyzer()); Query query = parser.parse(q); // Search for the query Hits hits = ind_searcher.search(query); // Examine the Hits object to see if there were any matches int hitCount = hits.length(); if (hitCount == 0) { System.out.println( "No matches were found for \"" + q + "\""); } else { System.out.println("Hits for \"" + q + "\" were found in pages:"); // Iterate over the Documents in the Hits object for (int i = 0; i < hitCount; i++) { Document doc = hits.doc(i); // Print the value that we stored in the "title" field. Note // that this Field was not indexed, but (unlike the // "contents" field) was stored verbatim and can be // retrieved. //System.out.println(" " + (i + 1) + ". " + doc.get("title")); System.out.println(" " + (i + 1) + ". " + doc.get("pagenumber")); } } ind_searcher.close(); I'm using lucene version 2.9.0 You said that Hits are deprecated. Should I use HitCollector instead? Another question came into my mind... What if I want do add another PDF document to the search pool. Before search I would like to specify the PDF document I would like to search and then return page number for searched String. I could create index for every document that I add to search pool but that doesn't sound good to me? Can you think of a better way to do that? Erick Erickson wrote: > > Your search would be on the "contents" field if you use LucenePDFDocument. > > But on a quick look, LucenePDFDocument doesn't give you any page > information. So, you'd have to collect that somehow, but I don't see a > clear > way to. > > Doing it manually, you could do something like: > > Document doc = new Document(); > for (each page in the document) { > doc.add("contents", ); > record the offset of the last term in the page you just indexed); > } > doc.add("metadata", ); > iw.addDocument(doc); > > Now, when you search you can get the offsets of the matching term, > then look in your metadata field for the page number. > > Perhaps you could use the LucenePDFDocument in conjunction with this > somehow, but I confess that I've never used it so it's not clear to me how > you'd do this. > > Incidentally, the Hits object is deprecated, what version of Lucene are > you intending to use? > > Best > Erick > > On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago wrote: > >> >> Thanks for the reply Erick. >> >> I would like to permanently index this content and search it >> multiple times so I would like a permanent copy and I want to search for >> different terms multiple >> times. >> >> My problem is that I dont know how to retrieve a page number where the >> sea
Re: search trough single pdf document - return page number
Your search would be on the "contents" field if you use LucenePDFDocument. But on a quick look, LucenePDFDocument doesn't give you any page information. So, you'd have to collect that somehow, but I don't see a clear way to. Doing it manually, you could do something like: Document doc = new Document(); for (each page in the document) { doc.add("contents", ); record the offset of the last term in the page you just indexed); } doc.add("metadata", ); iw.addDocument(doc); Now, when you search you can get the offsets of the matching term, then look in your metadata field for the page number. Perhaps you could use the LucenePDFDocument in conjunction with this somehow, but I confess that I've never used it so it's not clear to me how you'd do this. Incidentally, the Hits object is deprecated, what version of Lucene are you intending to use? Best Erick On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago wrote: > > Thanks for the reply Erick. > > I would like to permanently index this content and search it > multiple times so I would like a permanent copy and I want to search for > different terms multiple > times. > > My problem is that I dont know how to retrieve a page number where the > searched string was found so > if you could help on that issue, that would be great. > > // I would start like this: > // This part of code would create the index, right? > Document luceneDocument = LucenePDFDocument.getDocument( f ); > IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(), > true); > iwriter.addDocument(luceneDocument); > iwriter.close(); > > //and now for the search: > Directory fsDir = FSDirectory.getDirectory(index_dir, false); > IndexSearcher ind_search = new IndexSearcher(fsDir); > > //im not sure if "fieldname" would be the string that I'm searching? > QueryParser parser = new QueryParser("fieldname", new StandardAnalyzer()); > Query query = parser.parse(q); > > Hits hits = ind_search.search(query); > > //and I'm stuck here. Dont know how to retrieve the page number??? > > > > > > > > Erick Erickson wrote: > > > > It depends (tm). Do you want to permanently index this content and search > > it > > multiple times or is each search a one-off? If the latter, I'd look for > > packages specific to handling PDF files. Although since Reader takes > > forever > > to search a document, so I suspect there's not much joy there. > > If you want to parse the file once and search it many times, then yes, > > Lucene can help a lot. You could conceivable do this in a memory index if > > you didn't want a permanent copy. In this scheme, you'd index the file > > before the first search then use the in-menory index until you were done > > searching (assuming you wanted to search for different terms multiple > > times). You'd have to do some record-keeping to remember what the start > > and > > end offset of each page was so you could deal with the case that a > phrases > > you search for started on one page and ended on another. > > > > If this is off base, perhaps you could provide more details... > > > > Erick > > > > On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago wrote: > > > >> > >> Hi, > >> > >> I have to search a single pdf document for requested string and if that > >> string is found, I need to return a page number where that string was > >> found. > >> Requested string can be anything in a pdf document. > >> > >> It is a big document(abount 5000 pages) so I'm asking if that is > possible > >> with lucene. > >> > >> I'm using pdfbox class and i found a way to do it (searching with > >> instring > >> page by page) but it is too slow: > >> > >>PDDocument pddDocument=PDDocument.load(f); > >> > >>PDFTextStripper textStripper=new PDFTextStripper(); > >>int lastpage = textStripper.getEndPage(); > >>String page= null; > >>int found= 0; > >> > >>for(int i=1; i >>textStripper.setStartPage(i); > >>textStripper.setEndPage(i); > >> > >>page = textStripper.getText(pddDocument); > >> > >>found = page .indexOf(searchtext); > >> > >>if (found>0) {returnpage= i; break;} > >>} > >> > >> > >> Is there a way to speed up the search with lucene? Can I use indexing to > >> solve this problem? thanks. > >> > >> -- > >> View this message in context: > >> > http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html > >> Sent from the Lucene - Java Developer mailing list archive at > Nabble.com. > >> > >> > >> - > >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-dev-h...@lucene.apache.org > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25909908.html > Sent from the Lucene - Java Developer mailing list archive at Nabble.co
Re: search trough single pdf document - return page number
if you just have a single pdf document (it seems from the subject line this is the case), and you want to retrieve pages, maybe consider splitting the PDF into single pages. there is some functionality in pdfbox to do this. then index each page as a single lucene document (so you will have 5000 lucene documents, one for each page). this way you could do a search, and return page numbers easily. On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago wrote: > > Thanks for the reply Erick. > > I would like to permanently index this content and search it > multiple times so I would like a permanent copy and I want to search for > different terms multiple > times. > > My problem is that I dont know how to retrieve a page number where the > searched string was found so > if you could help on that issue, that would be great. > > // I would start like this: > // This part of code would create the index, right? > Document luceneDocument = LucenePDFDocument.getDocument( f ); > IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(), > true); > iwriter.addDocument(luceneDocument); > iwriter.close(); > > //and now for the search: > Directory fsDir = FSDirectory.getDirectory(index_dir, false); > IndexSearcher ind_search = new IndexSearcher(fsDir); > > //im not sure if "fieldname" would be the string that I'm searching? > QueryParser parser = new QueryParser("fieldname", new StandardAnalyzer()); > Query query = parser.parse(q); > > Hits hits = ind_search.search(query); > > //and I'm stuck here. Dont know how to retrieve the page number??? > > > > > > > > Erick Erickson wrote: > > > > It depends (tm). Do you want to permanently index this content and search > > it > > multiple times or is each search a one-off? If the latter, I'd look for > > packages specific to handling PDF files. Although since Reader takes > > forever > > to search a document, so I suspect there's not much joy there. > > If you want to parse the file once and search it many times, then yes, > > Lucene can help a lot. You could conceivable do this in a memory index if > > you didn't want a permanent copy. In this scheme, you'd index the file > > before the first search then use the in-menory index until you were done > > searching (assuming you wanted to search for different terms multiple > > times). You'd have to do some record-keeping to remember what the start > > and > > end offset of each page was so you could deal with the case that a > phrases > > you search for started on one page and ended on another. > > > > If this is off base, perhaps you could provide more details... > > > > Erick > > > > On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago wrote: > > > >> > >> Hi, > >> > >> I have to search a single pdf document for requested string and if that > >> string is found, I need to return a page number where that string was > >> found. > >> Requested string can be anything in a pdf document. > >> > >> It is a big document(abount 5000 pages) so I'm asking if that is > possible > >> with lucene. > >> > >> I'm using pdfbox class and i found a way to do it (searching with > >> instring > >> page by page) but it is too slow: > >> > >>PDDocument pddDocument=PDDocument.load(f); > >> > >>PDFTextStripper textStripper=new PDFTextStripper(); > >>int lastpage = textStripper.getEndPage(); > >>String page= null; > >>int found= 0; > >> > >>for(int i=1; i >>textStripper.setStartPage(i); > >>textStripper.setEndPage(i); > >> > >>page = textStripper.getText(pddDocument); > >> > >>found = page .indexOf(searchtext); > >> > >>if (found>0) {returnpage= i; break;} > >>} > >> > >> > >> Is there a way to speed up the search with lucene? Can I use indexing to > >> solve this problem? thanks. > >> > >> -- > >> View this message in context: > >> > http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html > >> Sent from the Lucene - Java Developer mailing list archive at > Nabble.com. > >> > >> > >> - > >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-dev-h...@lucene.apache.org > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25909908.html > Sent from the Lucene - Java Developer mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com
Re: search trough single pdf document - return page number
Thanks for the reply Erick. I would like to permanently index this content and search it multiple times so I would like a permanent copy and I want to search for different terms multiple times. My problem is that I dont know how to retrieve a page number where the searched string was found so if you could help on that issue, that would be great. // I would start like this: // This part of code would create the index, right? Document luceneDocument = LucenePDFDocument.getDocument( f ); IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(), true); iwriter.addDocument(luceneDocument); iwriter.close(); //and now for the search: Directory fsDir = FSDirectory.getDirectory(index_dir, false); IndexSearcher ind_search = new IndexSearcher(fsDir); //im not sure if "fieldname" would be the string that I'm searching? QueryParser parser = new QueryParser("fieldname", new StandardAnalyzer()); Query query = parser.parse(q); Hits hits = ind_search.search(query); //and I'm stuck here. Dont know how to retrieve the page number??? Erick Erickson wrote: > > It depends (tm). Do you want to permanently index this content and search > it > multiple times or is each search a one-off? If the latter, I'd look for > packages specific to handling PDF files. Although since Reader takes > forever > to search a document, so I suspect there's not much joy there. > If you want to parse the file once and search it many times, then yes, > Lucene can help a lot. You could conceivable do this in a memory index if > you didn't want a permanent copy. In this scheme, you'd index the file > before the first search then use the in-menory index until you were done > searching (assuming you wanted to search for different terms multiple > times). You'd have to do some record-keeping to remember what the start > and > end offset of each page was so you could deal with the case that a phrases > you search for started on one page and ended on another. > > If this is off base, perhaps you could provide more details... > > Erick > > On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago wrote: > >> >> Hi, >> >> I have to search a single pdf document for requested string and if that >> string is found, I need to return a page number where that string was >> found. >> Requested string can be anything in a pdf document. >> >> It is a big document(abount 5000 pages) so I'm asking if that is possible >> with lucene. >> >> I'm using pdfbox class and i found a way to do it (searching with >> instring >> page by page) but it is too slow: >> >>PDDocument pddDocument=PDDocument.load(f); >> >>PDFTextStripper textStripper=new PDFTextStripper(); >>int lastpage = textStripper.getEndPage(); >>String page= null; >>int found= 0; >> >>for(int i=1; i>textStripper.setStartPage(i); >>textStripper.setEndPage(i); >> >>page = textStripper.getText(pddDocument); >> >>found = page .indexOf(searchtext); >> >>if (found>0) {returnpage= i; break;} >>} >> >> >> Is there a way to speed up the search with lucene? Can I use indexing to >> solve this problem? thanks. >> >> -- >> View this message in context: >> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html >> Sent from the Lucene - Java Developer mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> > > -- View this message in context: http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25909908.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: search trough single pdf document - return page number
It depends (tm). Do you want to permanently index this content and search it multiple times or is each search a one-off? If the latter, I'd look for packages specific to handling PDF files. Although since Reader takes forever to search a document, so I suspect there's not much joy there. If you want to parse the file once and search it many times, then yes, Lucene can help a lot. You could conceivable do this in a memory index if you didn't want a permanent copy. In this scheme, you'd index the file before the first search then use the in-menory index until you were done searching (assuming you wanted to search for different terms multiple times). You'd have to do some record-keeping to remember what the start and end offset of each page was so you could deal with the case that a phrases you search for started on one page and ended on another. If this is off base, perhaps you could provide more details... Erick On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago wrote: > > Hi, > > I have to search a single pdf document for requested string and if that > string is found, I need to return a page number where that string was > found. > Requested string can be anything in a pdf document. > > It is a big document(abount 5000 pages) so I'm asking if that is possible > with lucene. > > I'm using pdfbox class and i found a way to do it (searching with instring > page by page) but it is too slow: > >PDDocument pddDocument=PDDocument.load(f); > >PDFTextStripper textStripper=new PDFTextStripper(); >int lastpage = textStripper.getEndPage(); >String page= null; >int found= 0; > >for(int i=1; itextStripper.setStartPage(i); >textStripper.setEndPage(i); > >page = textStripper.getText(pddDocument); > >found = page .indexOf(searchtext); > >if (found>0) {returnpage= i; break;} >} > > > Is there a way to speed up the search with lucene? Can I use indexing to > solve this problem? thanks. > > -- > View this message in context: > http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html > Sent from the Lucene - Java Developer mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >