Re: search trough single pdf document - return page number

2009-10-16 Thread IvanDrago

Yes, I tough of that too but i didn't know if I could search trough index
only documents that have specific field name. After some researching I found
a way to do that:

String q = "title:ant";
Query query = parser.parse(q);

title:ant -> Contain the term ant in the title field

Regards,
Ivan


Erick Erickson wrote:
> 
> Well, you have to add another field to each document identifying thePDF it
> came from. From there, restricting to that doc just becomes
> adding an AND clause. Of course how you specify these is "an
> exercise left to the reader" .
> 
> Erick
> 
> On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago  wrote:
> 
>>
>> Hey! I did it! Eric and Robert, you helped a lot. Thanks!
>>
>> I didn't use LucenePDFDocument. I created a new document for every page
>> in
>> a
>> PDF document and added paga number info for every page.
>>
>>PDDocument pddDocument=PDDocument.load(f);
>>PDFTextStripper textStripper=new PDFTextStripper();
>>
>> IndexWriter iwriter = new IndexWriter(index_dir, new
>> StandardAnalyzer(), true);
>>
>> long start = new Date().getTime();
>>
>>// 350 pages just for test
>>for(int i=1; i<350; i++){
>>//System.out.println("i= " + i);
>> textStripper.setStartPage(i);
>>textStripper.setEndPage(i);
>>
>> //fetch one page
>>pagecontent = textStripper.getText(pddDocument);
>>System.out.println("pagecontent: " + pagecontent);
>>
>>if (pagecontent != null){
>>System.out.println("i= " + i);
>>Document doc = new Document();
>>
>>// Add the pagenumber
>>doc.add(new Field("pagenumber", Integer.toString(i) ,
>> Field.Store.YES,
>>Field.Index.ANALYZED));
>>doc.add(new Field("content", pagecontent ,
>> Field.Store.NO,
>>Field.Index.ANALYZED));
>>
>>iwriter.addDocument(doc);
>>}
>>
>>}
>>
>>// Optimize and close the writer to finish building the index
>>iwriter.optimize();
>>iwriter.close();
>>
>>long end = new Date().getTime();
>>
>>System.out.println("Indexing files took "
>>+ (end - start) + " milliseconds");
>>
>>//just for test I searched for a string cryptography
>>String q = "cryptography";
>>
>>Directory fsDir = FSDirectory.getDirectory(index_dir, false);
>> IndexSearcher ind_searcher = new IndexSearcher(fsDir);
>>
>>// Build a Query object
>>QueryParser parser = new QueryParser("content", new
>> StandardAnalyzer());
>>Query query = parser.parse(q);
>>
>> // Search for the query
>>Hits hits = ind_searcher.search(query);
>>
>>// Examine the Hits object to see if there were any matches
>>int hitCount = hits.length();
>>if (hitCount == 0) {
>>System.out.println(
>>"No matches were found for \"" + q + "\"");
>>}
>>else {
>>System.out.println("Hits for \"" +
>>q + "\" were found in pages:");
>>
>>// Iterate over the Documents in the Hits object
>>for (int i = 0; i < hitCount; i++) {
>>Document doc = hits.doc(i);
>>
>>// Print the value that we stored in the "title" field.
>> Note
>>// that this Field was not indexed, but (unlike the
>>// "contents" field) was stored verbatim and can be
>>// retrieved.
>>//System.out.println("  " + (i + 1) + ". " +
>> doc.get("title"));
>>System.out.println("  " + (i + 1) + ". " +
>> doc.get("pagenumber"));
>>}
>>}
>>ind_searcher.close();
>>
>> 
>> I'm using lucene version 2.9.0
>> You said that Hits are deprecated. Should I use HitCollector instead?
>>
>> Another question came into my mind... What if I want do add another PDF
>> document to the search pool. Before search I would like to specify the
>> PDF
>> document I would like to search and then return page number for searched
>> String. I could create index for every document that I add to search pool
>> but that doesn't sound good to me? Can you think of a better way to do
>> that?
>>
>>
>> Erick Erickson wrote:
>> >
>> > Your search would be on the "contents" field if you use
>> LucenePDFDocument.
>> >
>> > But on a quick look, LucenePDFDocument doesn't give you any page
>> > information. So, you'd have to collect that somehow, but I don't see a
>> > clear
>> > way to.
>> >
>> > Doing it manually, you could do something like:
>> >
>> > Document doc = new Document();
>> > for (each page in the document) {
>> >   doc.add("contents", );
>> >   record the offset of the last term in the page you just indexed);
>> > }
>> > doc.add("metadata", );
>> > iw.addDocument(doc);
>> >
>> > Now, when you search y

Re: search trough single pdf document - return page number

2009-10-16 Thread Erick Erickson
Well, you have to add another field to each document identifying thePDF it
came from. From there, restricting to that doc just becomes
adding an AND clause. Of course how you specify these is "an
exercise left to the reader" .

Erick

On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago  wrote:

>
> Hey! I did it! Eric and Robert, you helped a lot. Thanks!
>
> I didn't use LucenePDFDocument. I created a new document for every page in
> a
> PDF document and added paga number info for every page.
>
>PDDocument pddDocument=PDDocument.load(f);
>PDFTextStripper textStripper=new PDFTextStripper();
>
> IndexWriter iwriter = new IndexWriter(index_dir, new
> StandardAnalyzer(), true);
>
> long start = new Date().getTime();
>
>// 350 pages just for test
>for(int i=1; i<350; i++){
>//System.out.println("i= " + i);
> textStripper.setStartPage(i);
>textStripper.setEndPage(i);
>
> //fetch one page
>pagecontent = textStripper.getText(pddDocument);
>System.out.println("pagecontent: " + pagecontent);
>
>if (pagecontent != null){
>System.out.println("i= " + i);
>Document doc = new Document();
>
>// Add the pagenumber
>doc.add(new Field("pagenumber", Integer.toString(i) ,
> Field.Store.YES,
>Field.Index.ANALYZED));
>doc.add(new Field("content", pagecontent ,
> Field.Store.NO,
>Field.Index.ANALYZED));
>
>iwriter.addDocument(doc);
>}
>
>}
>
>// Optimize and close the writer to finish building the index
>iwriter.optimize();
>iwriter.close();
>
>long end = new Date().getTime();
>
>System.out.println("Indexing files took "
>+ (end - start) + " milliseconds");
>
>//just for test I searched for a string cryptography
>String q = "cryptography";
>
>Directory fsDir = FSDirectory.getDirectory(index_dir, false);
> IndexSearcher ind_searcher = new IndexSearcher(fsDir);
>
>// Build a Query object
>QueryParser parser = new QueryParser("content", new
> StandardAnalyzer());
>Query query = parser.parse(q);
>
> // Search for the query
>Hits hits = ind_searcher.search(query);
>
>// Examine the Hits object to see if there were any matches
>int hitCount = hits.length();
>if (hitCount == 0) {
>System.out.println(
>"No matches were found for \"" + q + "\"");
>}
>else {
>System.out.println("Hits for \"" +
>q + "\" were found in pages:");
>
>// Iterate over the Documents in the Hits object
>for (int i = 0; i < hitCount; i++) {
>Document doc = hits.doc(i);
>
>// Print the value that we stored in the "title" field. Note
>// that this Field was not indexed, but (unlike the
>// "contents" field) was stored verbatim and can be
>// retrieved.
>//System.out.println("  " + (i + 1) + ". " +
> doc.get("title"));
>System.out.println("  " + (i + 1) + ". " +
> doc.get("pagenumber"));
>}
>}
>ind_searcher.close();
>
> 
> I'm using lucene version 2.9.0
> You said that Hits are deprecated. Should I use HitCollector instead?
>
> Another question came into my mind... What if I want do add another PDF
> document to the search pool. Before search I would like to specify the PDF
> document I would like to search and then return page number for searched
> String. I could create index for every document that I add to search pool
> but that doesn't sound good to me? Can you think of a better way to do
> that?
>
>
> Erick Erickson wrote:
> >
> > Your search would be on the "contents" field if you use
> LucenePDFDocument.
> >
> > But on a quick look, LucenePDFDocument doesn't give you any page
> > information. So, you'd have to collect that somehow, but I don't see a
> > clear
> > way to.
> >
> > Doing it manually, you could do something like:
> >
> > Document doc = new Document();
> > for (each page in the document) {
> >   doc.add("contents", );
> >   record the offset of the last term in the page you just indexed);
> > }
> > doc.add("metadata", );
> > iw.addDocument(doc);
> >
> > Now, when you search you can get the offsets of the matching term,
> > then look in your metadata field for the page number.
> >
> > Perhaps you could use the LucenePDFDocument in conjunction with this
> > somehow, but I confess that I've never used it so it's not clear to me
> how
> > you'd do this.
> >
> > Incidentally, the Hits object is deprecated, what version of Lucene are
> > you intending to use?
> >
> > Best
> > Erick
> >
> > On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago  wrote:
> >

Re: search trough single pdf document - return page number

2009-10-16 Thread IvanDrago

proximity queries that span pages are not a concern in my case.

I asked another question on the bottom of my last post. Could you comment on
that If you have some ideas?


Erick Erickson wrote:
> 
> Glad things are progressing. The only problem here will be
> proximityqueries
> that span pages. Say, the last word on page 10 is
> "salmon" and the first word on page 11 is "fishing". Structuring
> your index this way won't find the a proximity search for "salmon
> fishing".
> 
> If that's not a concern, then there's no reason to complexify the
> situation..
> 
> FWIW
> Erick
> 
> On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago  wrote:
> 
>>
>> Hey! I did it! Eric and Robert, you helped a lot. Thanks!
>>
>> I didn't use LucenePDFDocument. I created a new document for every page
>> in
>> a
>> PDF document and added paga number info for every page.
>>
>>PDDocument pddDocument=PDDocument.load(f);
>>PDFTextStripper textStripper=new PDFTextStripper();
>>
>> IndexWriter iwriter = new IndexWriter(index_dir, new
>> StandardAnalyzer(), true);
>>
>> long start = new Date().getTime();
>>
>>// 350 pages just for test
>>for(int i=1; i<350; i++){
>>//System.out.println("i= " + i);
>> textStripper.setStartPage(i);
>>textStripper.setEndPage(i);
>>
>> //fetch one page
>>pagecontent = textStripper.getText(pddDocument);
>>System.out.println("pagecontent: " + pagecontent);
>>
>>if (pagecontent != null){
>>System.out.println("i= " + i);
>>Document doc = new Document();
>>
>>// Add the pagenumber
>>doc.add(new Field("pagenumber", Integer.toString(i) ,
>> Field.Store.YES,
>>Field.Index.ANALYZED));
>>doc.add(new Field("content", pagecontent ,
>> Field.Store.NO,
>>Field.Index.ANALYZED));
>>
>>iwriter.addDocument(doc);
>>}
>>
>>}
>>
>>// Optimize and close the writer to finish building the index
>>iwriter.optimize();
>>iwriter.close();
>>
>>long end = new Date().getTime();
>>
>>System.out.println("Indexing files took "
>>+ (end - start) + " milliseconds");
>>
>>//just for test I searched for a string cryptography
>>String q = "cryptography";
>>
>>Directory fsDir = FSDirectory.getDirectory(index_dir, false);
>> IndexSearcher ind_searcher = new IndexSearcher(fsDir);
>>
>>// Build a Query object
>>QueryParser parser = new QueryParser("content", new
>> StandardAnalyzer());
>>Query query = parser.parse(q);
>>
>> // Search for the query
>>Hits hits = ind_searcher.search(query);
>>
>>// Examine the Hits object to see if there were any matches
>>int hitCount = hits.length();
>>if (hitCount == 0) {
>>System.out.println(
>>"No matches were found for \"" + q + "\"");
>>}
>>else {
>>System.out.println("Hits for \"" +
>>q + "\" were found in pages:");
>>
>>// Iterate over the Documents in the Hits object
>>for (int i = 0; i < hitCount; i++) {
>>Document doc = hits.doc(i);
>>
>>// Print the value that we stored in the "title" field.
>> Note
>>// that this Field was not indexed, but (unlike the
>>// "contents" field) was stored verbatim and can be
>>// retrieved.
>>//System.out.println("  " + (i + 1) + ". " +
>> doc.get("title"));
>>System.out.println("  " + (i + 1) + ". " +
>> doc.get("pagenumber"));
>>}
>>}
>>ind_searcher.close();
>>
>> 
>> I'm using lucene version 2.9.0
>> You said that Hits are deprecated. Should I use HitCollector instead?
>>
>> Another question came into my mind... What if I want do add another PDF
>> document to the search pool. Before search I would like to specify the
>> PDF
>> document I would like to search and then return page number for searched
>> String. I could create index for every document that I add to search pool
>> but that doesn't sound good to me? Can you think of a better way to do
>> that?
>>
>>
>> Erick Erickson wrote:
>> >
>> > Your search would be on the "contents" field if you use
>> LucenePDFDocument.
>> >
>> > But on a quick look, LucenePDFDocument doesn't give you any page
>> > information. So, you'd have to collect that somehow, but I don't see a
>> > clear
>> > way to.
>> >
>> > Doing it manually, you could do something like:
>> >
>> > Document doc = new Document();
>> > for (each page in the document) {
>> >   doc.add("contents", );
>> >   record the offset of the last term in the page you just indexed);
>> > }
>> > doc.add("metadata", );
>> > iw.addDocument(doc);
>> >
>> > Now, when

Re: search trough single pdf document - return page number

2009-10-16 Thread Erick Erickson
Glad things are progressing. The only problem here will be proximityqueries
that span pages. Say, the last word on page 10 is
"salmon" and the first word on page 11 is "fishing". Structuring
your index this way won't find the a proximity search for "salmon fishing".

If that's not a concern, then there's no reason to complexify the
situation..

FWIW
Erick

On Fri, Oct 16, 2009 at 8:01 AM, IvanDrago  wrote:

>
> Hey! I did it! Eric and Robert, you helped a lot. Thanks!
>
> I didn't use LucenePDFDocument. I created a new document for every page in
> a
> PDF document and added paga number info for every page.
>
>PDDocument pddDocument=PDDocument.load(f);
>PDFTextStripper textStripper=new PDFTextStripper();
>
> IndexWriter iwriter = new IndexWriter(index_dir, new
> StandardAnalyzer(), true);
>
> long start = new Date().getTime();
>
>// 350 pages just for test
>for(int i=1; i<350; i++){
>//System.out.println("i= " + i);
> textStripper.setStartPage(i);
>textStripper.setEndPage(i);
>
> //fetch one page
>pagecontent = textStripper.getText(pddDocument);
>System.out.println("pagecontent: " + pagecontent);
>
>if (pagecontent != null){
>System.out.println("i= " + i);
>Document doc = new Document();
>
>// Add the pagenumber
>doc.add(new Field("pagenumber", Integer.toString(i) ,
> Field.Store.YES,
>Field.Index.ANALYZED));
>doc.add(new Field("content", pagecontent ,
> Field.Store.NO,
>Field.Index.ANALYZED));
>
>iwriter.addDocument(doc);
>}
>
>}
>
>// Optimize and close the writer to finish building the index
>iwriter.optimize();
>iwriter.close();
>
>long end = new Date().getTime();
>
>System.out.println("Indexing files took "
>+ (end - start) + " milliseconds");
>
>//just for test I searched for a string cryptography
>String q = "cryptography";
>
>Directory fsDir = FSDirectory.getDirectory(index_dir, false);
> IndexSearcher ind_searcher = new IndexSearcher(fsDir);
>
>// Build a Query object
>QueryParser parser = new QueryParser("content", new
> StandardAnalyzer());
>Query query = parser.parse(q);
>
> // Search for the query
>Hits hits = ind_searcher.search(query);
>
>// Examine the Hits object to see if there were any matches
>int hitCount = hits.length();
>if (hitCount == 0) {
>System.out.println(
>"No matches were found for \"" + q + "\"");
>}
>else {
>System.out.println("Hits for \"" +
>q + "\" were found in pages:");
>
>// Iterate over the Documents in the Hits object
>for (int i = 0; i < hitCount; i++) {
>Document doc = hits.doc(i);
>
>// Print the value that we stored in the "title" field. Note
>// that this Field was not indexed, but (unlike the
>// "contents" field) was stored verbatim and can be
>// retrieved.
>//System.out.println("  " + (i + 1) + ". " +
> doc.get("title"));
>System.out.println("  " + (i + 1) + ". " +
> doc.get("pagenumber"));
>}
>}
>ind_searcher.close();
>
> 
> I'm using lucene version 2.9.0
> You said that Hits are deprecated. Should I use HitCollector instead?
>
> Another question came into my mind... What if I want do add another PDF
> document to the search pool. Before search I would like to specify the PDF
> document I would like to search and then return page number for searched
> String. I could create index for every document that I add to search pool
> but that doesn't sound good to me? Can you think of a better way to do
> that?
>
>
> Erick Erickson wrote:
> >
> > Your search would be on the "contents" field if you use
> LucenePDFDocument.
> >
> > But on a quick look, LucenePDFDocument doesn't give you any page
> > information. So, you'd have to collect that somehow, but I don't see a
> > clear
> > way to.
> >
> > Doing it manually, you could do something like:
> >
> > Document doc = new Document();
> > for (each page in the document) {
> >   doc.add("contents", );
> >   record the offset of the last term in the page you just indexed);
> > }
> > doc.add("metadata", );
> > iw.addDocument(doc);
> >
> > Now, when you search you can get the offsets of the matching term,
> > then look in your metadata field for the page number.
> >
> > Perhaps you could use the LucenePDFDocument in conjunction with this
> > somehow, but I confess that I've never used it so it's not clear to me
> how
> > you'd do this.
> >
> > Incidentally, the Hits object is deprecated, what version of 

Re: search trough single pdf document - return page number

2009-10-16 Thread IvanDrago

Hey! I did it! Eric and Robert, you helped a lot. Thanks!

I didn't use LucenePDFDocument. I created a new document for every page in a
PDF document and added paga number info for every page.

PDDocument pddDocument=PDDocument.load(f);
PDFTextStripper textStripper=new PDFTextStripper();

IndexWriter iwriter = new IndexWriter(index_dir, new
StandardAnalyzer(), true);

long start = new Date().getTime();

// 350 pages just for test
for(int i=1; i<350; i++){
//System.out.println("i= " + i);
textStripper.setStartPage(i);
textStripper.setEndPage(i);

//fetch one page
pagecontent = textStripper.getText(pddDocument);
System.out.println("pagecontent: " + pagecontent);

if (pagecontent != null){
System.out.println("i= " + i);
Document doc = new Document();

// Add the pagenumber
doc.add(new Field("pagenumber", Integer.toString(i) ,
Field.Store.YES,
Field.Index.ANALYZED));
doc.add(new Field("content", pagecontent , Field.Store.NO,
Field.Index.ANALYZED));

iwriter.addDocument(doc);
}

}

// Optimize and close the writer to finish building the index
iwriter.optimize();
iwriter.close();   

long end = new Date().getTime();

System.out.println("Indexing files took "
+ (end - start) + " milliseconds");

//just for test I searched for a string cryptography
String q = "cryptography";

Directory fsDir = FSDirectory.getDirectory(index_dir, false);
IndexSearcher ind_searcher = new IndexSearcher(fsDir);

// Build a Query object
QueryParser parser = new QueryParser("content", new
StandardAnalyzer());
Query query = parser.parse(q);

// Search for the query
Hits hits = ind_searcher.search(query);

// Examine the Hits object to see if there were any matches
int hitCount = hits.length();
if (hitCount == 0) {
System.out.println(
"No matches were found for \"" + q + "\"");
}
else {
System.out.println("Hits for \"" +
q + "\" were found in pages:");

// Iterate over the Documents in the Hits object
for (int i = 0; i < hitCount; i++) {
Document doc = hits.doc(i);

// Print the value that we stored in the "title" field. Note
// that this Field was not indexed, but (unlike the
// "contents" field) was stored verbatim and can be
// retrieved.
//System.out.println("  " + (i + 1) + ". " +
doc.get("title"));
System.out.println("  " + (i + 1) + ". " +
doc.get("pagenumber"));
}
}
ind_searcher.close();


I'm using lucene version 2.9.0
You said that Hits are deprecated. Should I use HitCollector instead?

Another question came into my mind... What if I want do add another PDF
document to the search pool. Before search I would like to specify the PDF
document I would like to search and then return page number for searched
String. I could create index for every document that I add to search pool
but that doesn't sound good to me? Can you think of a better way to do that?


Erick Erickson wrote:
> 
> Your search would be on the "contents" field if you use LucenePDFDocument.
> 
> But on a quick look, LucenePDFDocument doesn't give you any page
> information. So, you'd have to collect that somehow, but I don't see a
> clear
> way to.
> 
> Doing it manually, you could do something like:
> 
> Document doc = new Document();
> for (each page in the document) {
>   doc.add("contents", );
>   record the offset of the last term in the page you just indexed);
> }
> doc.add("metadata", );
> iw.addDocument(doc);
> 
> Now, when you search you can get the offsets of the matching term,
> then look in your metadata field for the page number.
> 
> Perhaps you could use the LucenePDFDocument in conjunction with this
> somehow, but I confess that I've never used it so it's not clear to me how
> you'd do this.
> 
> Incidentally, the Hits object is deprecated, what version of Lucene are
> you intending to use?
> 
> Best
> Erick
> 
> On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago  wrote:
> 
>>
>> Thanks for the reply Erick.
>>
>> I would like to permanently index this content and search it
>> multiple times so I would like a permanent copy and I want to search for
>> different terms multiple
>> times.
>>
>> My problem is that I dont know how to retrieve a page number where the
>> sea

Re: search trough single pdf document - return page number

2009-10-15 Thread Erick Erickson
Your search would be on the "contents" field if you use LucenePDFDocument.

But on a quick look, LucenePDFDocument doesn't give you any page
information. So, you'd have to collect that somehow, but I don't see a clear
way to.

Doing it manually, you could do something like:

Document doc = new Document();
for (each page in the document) {
  doc.add("contents", );
  record the offset of the last term in the page you just indexed);
}
doc.add("metadata", );
iw.addDocument(doc);

Now, when you search you can get the offsets of the matching term,
then look in your metadata field for the page number.

Perhaps you could use the LucenePDFDocument in conjunction with this
somehow, but I confess that I've never used it so it's not clear to me how
you'd do this.

Incidentally, the Hits object is deprecated, what version of Lucene are
you intending to use?

Best
Erick

On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago  wrote:

>
> Thanks for the reply Erick.
>
> I would like to permanently index this content and search it
> multiple times so I would like a permanent copy and I want to search for
> different terms multiple
> times.
>
> My problem is that I dont know how to retrieve a page number where the
> searched string was found so
> if you could help on that issue, that would be great.
>
> // I would start like this:
> // This part of code would create the index, right?
> Document luceneDocument = LucenePDFDocument.getDocument( f );
> IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(),
> true);
> iwriter.addDocument(luceneDocument);
> iwriter.close();
>
> //and now for the search:
> Directory fsDir = FSDirectory.getDirectory(index_dir, false);
> IndexSearcher ind_search = new IndexSearcher(fsDir);
>
> //im not sure if "fieldname" would be the string that I'm searching?
> QueryParser parser = new QueryParser("fieldname", new StandardAnalyzer());
> Query query = parser.parse(q);
>
> Hits hits = ind_search.search(query);
>
> //and I'm stuck here. Dont know how to retrieve the page number???
>
>
>
>
>
>
>
> Erick Erickson wrote:
> >
> > It depends (tm). Do you want to permanently index this content and search
> > it
> > multiple times or is each search a one-off? If the latter, I'd look for
> > packages specific to handling PDF files. Although since Reader takes
> > forever
> > to search a document, so I suspect there's not much joy there.
> > If you want to parse the file once and search it many times, then yes,
> > Lucene can help a lot. You could conceivable do this in a memory index if
> > you didn't want a permanent copy. In this scheme, you'd index the file
> > before the first search then use the in-menory index until you were done
> > searching (assuming you wanted to search for different terms multiple
> > times). You'd have to do some record-keeping to remember what the start
> > and
> > end offset of each page was so you could deal with the case that a
> phrases
> > you search for started on one page and ended on another.
> >
> > If this is off base, perhaps you could provide more details...
> >
> > Erick
> >
> > On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago  wrote:
> >
> >>
> >> Hi,
> >>
> >> I have to search a single pdf document for requested string and if that
> >> string is found, I need to return a page number where that string was
> >> found.
> >> Requested string can be anything in a pdf document.
> >>
> >> It is a big document(abount 5000 pages) so I'm asking if that is
> possible
> >> with lucene.
> >>
> >> I'm using pdfbox class and i found a way to do it (searching with
> >> instring
> >> page by page) but it is too slow:
> >>
> >>PDDocument pddDocument=PDDocument.load(f);
> >>
> >>PDFTextStripper textStripper=new PDFTextStripper();
> >>int lastpage = textStripper.getEndPage();
> >>String page= null;
> >>int found= 0;
> >>
> >>for(int i=1; i >>textStripper.setStartPage(i);
> >>textStripper.setEndPage(i);
> >>
> >>page = textStripper.getText(pddDocument);
> >>
> >>found = page .indexOf(searchtext);
> >>
> >>if (found>0) {returnpage= i; break;}
> >>}
> >> 
> >>
> >> Is there a way to speed up the search with lucene? Can I use indexing to
> >> solve this problem? thanks.
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html
> >> Sent from the Lucene - Java Developer mailing list archive at
> Nabble.com.
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25909908.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.co

Re: search trough single pdf document - return page number

2009-10-15 Thread Robert Muir
if you just have a single pdf document (it seems from the subject line this
is the case), and you want to retrieve pages, maybe consider splitting the
PDF into single pages.

there is some functionality in pdfbox to do this.

then index each page as a single lucene document (so you will have 5000
lucene documents, one for each page). this way you could do a search, and
return page numbers easily.

On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago  wrote:

>
> Thanks for the reply Erick.
>
> I would like to permanently index this content and search it
> multiple times so I would like a permanent copy and I want to search for
> different terms multiple
> times.
>
> My problem is that I dont know how to retrieve a page number where the
> searched string was found so
> if you could help on that issue, that would be great.
>
> // I would start like this:
> // This part of code would create the index, right?
> Document luceneDocument = LucenePDFDocument.getDocument( f );
> IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(),
> true);
> iwriter.addDocument(luceneDocument);
> iwriter.close();
>
> //and now for the search:
> Directory fsDir = FSDirectory.getDirectory(index_dir, false);
> IndexSearcher ind_search = new IndexSearcher(fsDir);
>
> //im not sure if "fieldname" would be the string that I'm searching?
> QueryParser parser = new QueryParser("fieldname", new StandardAnalyzer());
> Query query = parser.parse(q);
>
> Hits hits = ind_search.search(query);
>
> //and I'm stuck here. Dont know how to retrieve the page number???
>
>
>
>
>
>
>
> Erick Erickson wrote:
> >
> > It depends (tm). Do you want to permanently index this content and search
> > it
> > multiple times or is each search a one-off? If the latter, I'd look for
> > packages specific to handling PDF files. Although since Reader takes
> > forever
> > to search a document, so I suspect there's not much joy there.
> > If you want to parse the file once and search it many times, then yes,
> > Lucene can help a lot. You could conceivable do this in a memory index if
> > you didn't want a permanent copy. In this scheme, you'd index the file
> > before the first search then use the in-menory index until you were done
> > searching (assuming you wanted to search for different terms multiple
> > times). You'd have to do some record-keeping to remember what the start
> > and
> > end offset of each page was so you could deal with the case that a
> phrases
> > you search for started on one page and ended on another.
> >
> > If this is off base, perhaps you could provide more details...
> >
> > Erick
> >
> > On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago  wrote:
> >
> >>
> >> Hi,
> >>
> >> I have to search a single pdf document for requested string and if that
> >> string is found, I need to return a page number where that string was
> >> found.
> >> Requested string can be anything in a pdf document.
> >>
> >> It is a big document(abount 5000 pages) so I'm asking if that is
> possible
> >> with lucene.
> >>
> >> I'm using pdfbox class and i found a way to do it (searching with
> >> instring
> >> page by page) but it is too slow:
> >>
> >>PDDocument pddDocument=PDDocument.load(f);
> >>
> >>PDFTextStripper textStripper=new PDFTextStripper();
> >>int lastpage = textStripper.getEndPage();
> >>String page= null;
> >>int found= 0;
> >>
> >>for(int i=1; i >>textStripper.setStartPage(i);
> >>textStripper.setEndPage(i);
> >>
> >>page = textStripper.getText(pddDocument);
> >>
> >>found = page .indexOf(searchtext);
> >>
> >>if (found>0) {returnpage= i; break;}
> >>}
> >> 
> >>
> >> Is there a way to speed up the search with lucene? Can I use indexing to
> >> solve this problem? thanks.
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html
> >> Sent from the Lucene - Java Developer mailing list archive at
> Nabble.com.
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25909908.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com


Re: search trough single pdf document - return page number

2009-10-15 Thread IvanDrago

Thanks for the reply Erick.

I would like to permanently index this content and search it
multiple times so I would like a permanent copy and I want to search for
different terms multiple
times.

My problem is that I dont know how to retrieve a page number where the
searched string was found so
if you could help on that issue, that would be great.

// I would start like this:
// This part of code would create the index, right?
Document luceneDocument = LucenePDFDocument.getDocument( f );
IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(),
true);
iwriter.addDocument(luceneDocument);
iwriter.close();

//and now for the search:
Directory fsDir = FSDirectory.getDirectory(index_dir, false);
IndexSearcher ind_search = new IndexSearcher(fsDir);

//im not sure if "fieldname" would be the string that I'm searching?
QueryParser parser = new QueryParser("fieldname", new StandardAnalyzer());
Query query = parser.parse(q);

Hits hits = ind_search.search(query);

//and I'm stuck here. Dont know how to retrieve the page number???




 


Erick Erickson wrote:
> 
> It depends (tm). Do you want to permanently index this content and search
> it
> multiple times or is each search a one-off? If the latter, I'd look for
> packages specific to handling PDF files. Although since Reader takes
> forever
> to search a document, so I suspect there's not much joy there.
> If you want to parse the file once and search it many times, then yes,
> Lucene can help a lot. You could conceivable do this in a memory index if
> you didn't want a permanent copy. In this scheme, you'd index the file
> before the first search then use the in-menory index until you were done
> searching (assuming you wanted to search for different terms multiple
> times). You'd have to do some record-keeping to remember what the start
> and
> end offset of each page was so you could deal with the case that a phrases
> you search for started on one page and ended on another.
> 
> If this is off base, perhaps you could provide more details...
> 
> Erick
> 
> On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago  wrote:
> 
>>
>> Hi,
>>
>> I have to search a single pdf document for requested string and if that
>> string is found, I need to return a page number where that string was
>> found.
>> Requested string can be anything in a pdf document.
>>
>> It is a big document(abount 5000 pages) so I'm asking if that is possible
>> with lucene.
>>
>> I'm using pdfbox class and i found a way to do it (searching with
>> instring
>> page by page) but it is too slow:
>>
>>PDDocument pddDocument=PDDocument.load(f);
>>
>>PDFTextStripper textStripper=new PDFTextStripper();
>>int lastpage = textStripper.getEndPage();
>>String page= null;
>>int found= 0;
>>
>>for(int i=1; i>textStripper.setStartPage(i);
>>textStripper.setEndPage(i);
>>
>>page = textStripper.getText(pddDocument);
>>
>>found = page .indexOf(searchtext);
>>
>>if (found>0) {returnpage= i; break;}
>>}
>> 
>>
>> Is there a way to speed up the search with lucene? Can I use indexing to
>> solve this problem? thanks.
>>
>> --
>> View this message in context:
>> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25909908.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: search trough single pdf document - return page number

2009-10-15 Thread Erick Erickson
It depends (tm). Do you want to permanently index this content and search it
multiple times or is each search a one-off? If the latter, I'd look for
packages specific to handling PDF files. Although since Reader takes forever
to search a document, so I suspect there's not much joy there.
If you want to parse the file once and search it many times, then yes,
Lucene can help a lot. You could conceivable do this in a memory index if
you didn't want a permanent copy. In this scheme, you'd index the file
before the first search then use the in-menory index until you were done
searching (assuming you wanted to search for different terms multiple
times). You'd have to do some record-keeping to remember what the start and
end offset of each page was so you could deal with the case that a phrases
you search for started on one page and ended on another.

If this is off base, perhaps you could provide more details...

Erick

On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago  wrote:

>
> Hi,
>
> I have to search a single pdf document for requested string and if that
> string is found, I need to return a page number where that string was
> found.
> Requested string can be anything in a pdf document.
>
> It is a big document(abount 5000 pages) so I'm asking if that is possible
> with lucene.
>
> I'm using pdfbox class and i found a way to do it (searching with instring
> page by page) but it is too slow:
>
>PDDocument pddDocument=PDDocument.load(f);
>
>PDFTextStripper textStripper=new PDFTextStripper();
>int lastpage = textStripper.getEndPage();
>String page= null;
>int found= 0;
>
>for(int i=1; itextStripper.setStartPage(i);
>textStripper.setEndPage(i);
>
>page = textStripper.getText(pddDocument);
>
>found = page .indexOf(searchtext);
>
>if (found>0) {returnpage= i; break;}
>}
> 
>
> Is there a way to speed up the search with lucene? Can I use indexing to
> solve this problem? thanks.
>
> --
> View this message in context:
> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>