Re: search trough single pdf document - return page number

IvanDrago Fri, 16 Oct 2009 05:02:29 -0700

Hey! I did it! Eric and Robert, you helped a lot. Thanks!

I didn't use LucenePDFDocument. I created a new document for every page in a
PDF document and added paga number info for every page.


        PDDocument pddDocument=PDDocument.load(f);
        PDFTextStripper textStripper=new PDFTextStripper();

        IndexWriter iwriter = new IndexWriter(index_dir, new
StandardAnalyzer(), true);        
        
        long start = new Date().getTime();
        
        // 350 pages just for test
        for(int i=1; i<350; i++){
            //System.out.println("i= " + i);
            textStripper.setStartPage(i);
            textStripper.setEndPage(i);            
            
            //fetch one page
            pagecontent = textStripper.getText(pddDocument);
            System.out.println("pagecontent: " + pagecontent);            
            
            if (pagecontent != null){
                System.out.println("i= " + i);
                    Document doc = new Document();
        
                    // Add the pagenumber
                    doc.add(new Field("pagenumber", Integer.toString(i) ,
Field.Store.YES,
                            Field.Index.ANALYZED));
                    doc.add(new Field("content", pagecontent , Field.Store.NO,
                            Field.Index.ANALYZED));        
        
                        iwriter.addDocument(doc);
            }
                
        }    
            
        // Optimize and close the writer to finish building the index
        iwriter.optimize();
            iwriter.close();                   

        long end = new Date().getTime();
        
        System.out.println("Indexing files took "
        + (end - start) + " milliseconds");

        //just for test I searched for a string cryptography
        String q = "cryptography";
        
        Directory fsDir = FSDirectory.getDirectory(index_dir, false);        
        IndexSearcher ind_searcher = new IndexSearcher(fsDir);
        
        // Build a Query object
        QueryParser parser = new QueryParser("content", new
StandardAnalyzer());
        Query query = parser.parse(q);

        // Search for the query
        Hits hits = ind_searcher.search(query);

        // Examine the Hits object to see if there were any matches
        int hitCount = hits.length();
        if (hitCount == 0) {
            System.out.println(
                "No matches were found for \"" + q + "\"");
        }
        else {
            System.out.println("Hits for \"" +
                q + "\" were found in pages:");

            // Iterate over the Documents in the Hits object
            for (int i = 0; i < hitCount; i++) {
                Document doc = hits.doc(i);

                // Print the value that we stored in the "title" field. Note
                // that this Field was not indexed, but (unlike the
                // "contents" field) was stored verbatim and can be
                // retrieved.
                //System.out.println("  " + (i + 1) + ". " +
doc.get("title"));
                System.out.println("  " + (i + 1) + ". " +
doc.get("pagenumber"));
            }
        }
        ind_searcher.close();

--------------------
I'm using lucene version 2.9.0
You said that Hits are deprecated. Should I use HitCollector instead?

Another question came into my mind... What if I want do add another PDF
document to the search pool. Before search I would like to specify the PDF
document I would like to search and then return page number for searched
String. I could create index for every document that I add to search pool
but that doesn't sound good to me? Can you think of a better way to do that?


Erick Erickson wrote:
> 
> Your search would be on the "contents" field if you use LucenePDFDocument.
> 
> But on a quick look, LucenePDFDocument doesn't give you any page
> information. So, you'd have to collect that somehow, but I don't see a
> clear
> way to.
> 
> Doing it manually, you could do something like:
> 
> Document doc = new Document();
> for (each page in the document) {
>   doc.add("contents", <text for page>);
>   record the offset of the last term in the page you just indexed);
> }
> doc.add("metadata", <string representation of the page offsets>);
> iw.addDocument(doc);
> 
> Now, when you search you can get the offsets of the matching term,
> then look in your metadata field for the page number.
> 
> Perhaps you could use the LucenePDFDocument in conjunction with this
> somehow, but I confess that I've never used it so it's not clear to me how
> you'd do this.
> 
> Incidentally, the Hits object is deprecated, what version of Lucene are
> you intending to use?
> 
> Best
> Erick
> 
> On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago <[email protected]> wrote:
> 
>>
>> Thanks for the reply Erick.
>>
>> I would like to permanently index this content and search it
>> multiple times so I would like a permanent copy and I want to search for
>> different terms multiple
>> times.
>>
>> My problem is that I dont know how to retrieve a page number where the
>> searched string was found so
>> if you could help on that issue, that would be great.
>>
>> // I would start like this:
>> // This part of code would create the index, right?
>> Document luceneDocument = LucenePDFDocument.getDocument( f );
>> IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(),
>> true);
>> iwriter.addDocument(luceneDocument);
>> iwriter.close();
>>
>> //and now for the search:
>> Directory fsDir = FSDirectory.getDirectory(index_dir, false);
>> IndexSearcher ind_search = new IndexSearcher(fsDir);
>>
>> //im not sure if "fieldname" would be the string that I'm searching?
>> QueryParser parser = new QueryParser("fieldname", new
>> StandardAnalyzer());
>> Query query = parser.parse(q);
>>
>> Hits hits = ind_search.search(query);
>>
>> //and I'm stuck here. Dont know how to retrieve the page number???
>>
>>
>>
>>
>>
>>
>>
>> Erick Erickson wrote:
>> >
>> > It depends (tm). Do you want to permanently index this content and
>> search
>> > it
>> > multiple times or is each search a one-off? If the latter, I'd look for
>> > packages specific to handling PDF files. Although since Reader takes
>> > forever
>> > to search a document, so I suspect there's not much joy there.
>> > If you want to parse the file once and search it many times, then yes,
>> > Lucene can help a lot. You could conceivable do this in a memory index
>> if
>> > you didn't want a permanent copy. In this scheme, you'd index the file
>> > before the first search then use the in-menory index until you were
>> done
>> > searching (assuming you wanted to search for different terms multiple
>> > times). You'd have to do some record-keeping to remember what the start
>> > and
>> > end offset of each page was so you could deal with the case that a
>> phrases
>> > you search for started on one page and ended on another.....
>> >
>> > If this is off base, perhaps you could provide more details...
>> >
>> > Erick
>> >
>> > On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago <[email protected]> wrote:
>> >
>> >>
>> >> Hi,
>> >>
>> >> I have to search a single pdf document for requested string and if
>> that
>> >> string is found, I need to return a page number where that string was
>> >> found.
>> >> Requested string can be anything in a pdf document.
>> >>
>> >> It is a big document(abount 5000 pages) so I'm asking if that is
>> possible
>> >> with lucene.
>> >>
>> >> I'm using pdfbox class and i found a way to do it (searching with
>> >> instring
>> >> page by page) but it is too slow:
>> >>
>> >>        PDDocument pddDocument=PDDocument.load(f);
>> >>
>> >>        PDFTextStripper textStripper=new PDFTextStripper();
>> >>        int lastpage = textStripper.getEndPage();
>> >>        String page= null;
>> >>        int found= 0;
>> >>
>> >>        for(int i=1; i<lastpage ; i++){
>> >>            textStripper.setStartPage(i);
>> >>            textStripper.setEndPage(i);
>> >>
>> >>            page = textStripper.getText(pddDocument);
>> >>
>> >>            found = page .indexOf(searchtext);
>> >>
>> >>            if (found>0) {returnpage= i; break;}
>> >>        }
>> >> ----------------
>> >>
>> >> Is there a way to speed up the search with lucene? Can I use indexing
>> to
>> >> solve this problem? thanks.
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html
>> >> Sent from the Lucene - Java Developer mailing list archive at
>> Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25909908.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25924250.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: search trough single pdf document - return page number

Reply via email to