Re: [ANN] PDFBox 0.6.0

2003-03-09 Thread Ben Litchfield

I believe this problem has been fixed with 0.6.1.  Please give it a try.

Ben Litchfield

-- 

On Thu, 6 Mar 2003, Eric Anderson wrote:

 When it throws the exception, the indexer fails, so I cannot continue the index.

 It appears that it's only related to some files, as I have been able to remove
 some of the files, and it will continue past that point, but if it encounters
 one of these files, the index fails.

 Eric Anderson
 LanRx Network Solutions
 815-505-6132


 Quoting Ben Litchfield [EMAIL PROTECTED]:

  In this release I have changed how I parsed the document, which may have
  introduced this bug.  I have received another report of this and will have
  it fixed for the next point release.
 
  You said you tried with reasonably sized PDF repository.  Did you stop
  indexing at this error or did you continue?  If you continued, is this the
  only error that you got?
 
  -Ben
 
 
 
 
  --
 
  On Thu, 6 Mar 2003, Eric Anderson wrote:
 
   Ben-
   In attempting to use the PDFBox-0.6.0, I rec'd the following error when
   attempting to scan a reasonably sized PDF repository.
  
   Any thoughts?
  
  
caught a class java.io.EOFException
with message: Unexpected end of ZLIB input stream
  
  
   Eric Anderson
   LanRx Network Solutions
  
  
   Quoting Ben Litchfield [EMAIL PROTECTED]:
  
I would like to announce the next release of PDFBox.  PDFBox allows for
PDF documents to be indexed using lucene through a simple interface.
Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument,
which will extract all text and PDF document summary properties as
  lucene
fields.
   
You can obtain the latest release from http://www.pdfbox.org
   
Please send all bug reports to me and attach the PDF document when
possible.
   
RELEASE 0.6.0
-Massive improvements to memory footprint.
-Must call close() on the COSDocument(LucenePDFDocument does this for
  you)
-Really fixed the bug where small documents were not being indexed.
-Fixed bug where no whitespace existed between obj and start of object.
Exception in thread main java.io.IOException: expected='obj'
actual='obj/Pro
-Fixed issue with spacing where textLineMatrix was not being copied
 properly
-Fixed 'bug' where parsing would fail with some pdfs with double endobj
 definitions
-Added PDF document summary fields to the lucene document
   
   
Thank you,
Ben Litchfield
http://www.pdfbox.org
   
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   
  
   LanRx Network Solutions, Inc.
   Providing Enterprise Level Solutions...On A Small Business Budget
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 

 LanRx Network Solutions, Inc.
 Providing Enterprise Level Solutions...On A Small Business Budget

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [ANN] PDFBox 0.6.0

2003-03-06 Thread Eric Anderson
Ben-
In attempting to use the PDFBox-0.6.0, I rec'd the following error when 
attempting to scan a reasonably sized PDF repository.

Any thoughts?


 caught a class java.io.EOFException
 with message: Unexpected end of ZLIB input stream


Eric Anderson
LanRx Network Solutions


Quoting Ben Litchfield [EMAIL PROTECTED]:

 I would like to announce the next release of PDFBox.  PDFBox allows for
 PDF documents to be indexed using lucene through a simple interface.
 Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument,
 which will extract all text and PDF document summary properties as lucene
 fields.
 
 You can obtain the latest release from http://www.pdfbox.org
 
 Please send all bug reports to me and attach the PDF document when
 possible.
 
 RELEASE 0.6.0
 -Massive improvements to memory footprint.
 -Must call close() on the COSDocument(LucenePDFDocument does this for you)
 -Really fixed the bug where small documents were not being indexed.
 -Fixed bug where no whitespace existed between obj and start of object.
 Exception in thread main java.io.IOException: expected='obj'
 actual='obj/Pro
 -Fixed issue with spacing where textLineMatrix was not being copied
  properly
 -Fixed 'bug' where parsing would fail with some pdfs with double endobj
  definitions
 -Added PDF document summary fields to the lucene document
 
 
 Thank you,
 Ben Litchfield
 http://www.pdfbox.org
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

LanRx Network Solutions, Inc.
Providing Enterprise Level Solutions...On A Small Business Budget

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [ANN] PDFBox 0.6.0

2003-03-06 Thread Ben Litchfield
In this release I have changed how I parsed the document, which may have
introduced this bug.  I have received another report of this and will have
it fixed for the next point release.

You said you tried with reasonably sized PDF repository.  Did you stop
indexing at this error or did you continue?  If you continued, is this the
only error that you got?

-Ben




-- 

On Thu, 6 Mar 2003, Eric Anderson wrote:

 Ben-
 In attempting to use the PDFBox-0.6.0, I rec'd the following error when
 attempting to scan a reasonably sized PDF repository.

 Any thoughts?


  caught a class java.io.EOFException
  with message: Unexpected end of ZLIB input stream


 Eric Anderson
 LanRx Network Solutions


 Quoting Ben Litchfield [EMAIL PROTECTED]:

  I would like to announce the next release of PDFBox.  PDFBox allows for
  PDF documents to be indexed using lucene through a simple interface.
  Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument,
  which will extract all text and PDF document summary properties as lucene
  fields.
 
  You can obtain the latest release from http://www.pdfbox.org
 
  Please send all bug reports to me and attach the PDF document when
  possible.
 
  RELEASE 0.6.0
  -Massive improvements to memory footprint.
  -Must call close() on the COSDocument(LucenePDFDocument does this for you)
  -Really fixed the bug where small documents were not being indexed.
  -Fixed bug where no whitespace existed between obj and start of object.
  Exception in thread main java.io.IOException: expected='obj'
  actual='obj/Pro
  -Fixed issue with spacing where textLineMatrix was not being copied
   properly
  -Fixed 'bug' where parsing would fail with some pdfs with double endobj
   definitions
  -Added PDF document summary fields to the lucene document
 
 
  Thank you,
  Ben Litchfield
  http://www.pdfbox.org
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 

 LanRx Network Solutions, Inc.
 Providing Enterprise Level Solutions...On A Small Business Budget

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [ANN] PDFBox 0.6.0

2003-03-06 Thread xx28
Ben,

I downloaded pdfbox and installed it. And I can use:
 java org.pdfbox.Main PDF-file output-text-file
to convert .pdf file to string file.

Then I tried to integrate with Lucene. I modified the following codes in 
IndexHTML.java:

else if(file.getPath().endsWith(.pdf)) {
Document doc =  LucenePDFDocument.getDocument(file);
System.out.println(adding  + pdf files);
writer.addDocument(doc);
}

It did pass ant compiler (ant wardemo). However, when I tested:
java org.apache.lucene.demo.IndexHTML -create -index {index-dir} ..

It seems to me it still didnot pick up new IndexHTML.java, still did not index 
.pdf files.


Did I miss something here?

Regards,

George

= Original Message From Lucene Users List 
[EMAIL PROTECTED] =
I would like to announce the next release of PDFBox.  PDFBox allows for
PDF documents to be indexed using lucene through a simple interface.
Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument,
which will extract all text and PDF document summary properties as lucene
fields.

You can obtain the latest release from http://www.pdfbox.org

Please send all bug reports to me and attach the PDF document when
possible.

RELEASE 0.6.0
-Massive improvements to memory footprint.
-Must call close() on the COSDocument(LucenePDFDocument does this for you)
-Really fixed the bug where small documents were not being indexed.
-Fixed bug where no whitespace existed between obj and start of object.
Exception in thread main java.io.IOException: expected='obj'
actual='obj/Pro
-Fixed issue with spacing where textLineMatrix was not being copied
 properly
-Fixed 'bug' where parsing would fail with some pdfs with double endobj
 definitions
-Added PDF document summary fields to the lucene document


Thank you,
Ben Litchfield
http://www.pdfbox.org



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]