Re: extracting non-english text from word, pdf, etc....??

2007-08-02 Thread Ben Litchfield
In terms of PDF documents... PDFBox should work just fine with any latin based languages; at this time certain PDFs that have CJK characters can pose some issues. In general english/french/spanish should be fine. Some PDFs use custom encodings that make it impossible to extract text and

Re: Indexing PDF document

2007-06-06 Thread Ben Litchfield
you need to include the both the bouncy castle jars and FontBox jar. Both are included with the PDFBox distribution. Ben Quoting jim shirreffs <[EMAIL PROTECTED]>: Thanks I rebuilt PDFbox and got past that problem but now I am getting Exception in thread "main" java.lang.NoClassDefFoundEr

Re: decrypting a PDF to read the content

2007-02-12 Thread Ben Litchfield
PDFBox comes with a version of BouncyCastle that will work. It is likely that other versions will also work as well. Is there a specific version that you have tried and didn't work? Ben Quoting Alixandre Santana <[EMAIL PROTECTED]>: Hi All, I got this error when i tried to decrypt a pdf d

Re: Full disk space during indexing process with 120 gb of free disk space

2006-12-04 Thread Ben Litchfield
PDFBox version 0.6 is quite old and there have been many improvements, you should look at moving to the newest version 0.7.3, although from the description of your problem it probably would not resolve it. If there are a large number of temp files with "pdfbox" in the name then you are most li

Re: Out of memory error

2006-07-13 Thread Ben Litchfield
By 300MG I assume you mean 300MB. You can also try extracting the text outside of lucene by using a PDFBox command line app. java org.pdfbox.ExtractText you may need to increase the JRE memory like this java -Xmx512m .pdfbox.ExtractText OR java -Xmx1024m .pdfbox.ExtractText If this is

Re: Can PDFBox or POI handle multi-byte characters with different enc odings?

2006-02-10 Thread Ben Litchfield
PDFBox can handle multi-byte encodings. There are a couple recent fixes for CJK languages that are not part of 0.7.2 but are part of the nightly build. Ben On Fri, 10 Feb 2006, Zhang, Lisheng wrote: > Hi, > > Currently we are using PDFBox to process PDF files and > POI to process DOC/XLS fil

Re: Java heap space ...after index process

2005-10-26 Thread Ben Litchfield
Is this only after the entire indexing process is finished or do you mean it happens on one of the documents you are extracting text from? Are you closing the PDDocument objects when you are done with them? What heap size are you using and have you tried increasing it? What version of PDFBox?

RE: Lucene in Action : example code -> document-parsing framework ...

2005-10-17 Thread Ben Litchfield
In addition, the latest version(0.7.2) of PDFBox does not require log4j, so you could also upgrade to that version. Ben On Mon, 17 Oct 2005 [EMAIL PROTECTED] wrote: > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/log4j/Logger > at org.pdfbox.pdfparser.BaseParser.(Ba

RE: PDFBox PDFExtractor

2005-09-12 Thread Ben Litchfield
lucene fairly easily. I highly suggest you do some tests against your own set of PDF documents. A new version of PDFBox was released this weekend and does have some improvements in terms of speed and memory. Ben Litchfield PDFBox http://www.pdfbox.org/ On Mon, 12 Sep 2005 [EMAIL PROTECTED] wrote

Re: Integrating lucene search with adobe search

2005-08-15 Thread Ben Litchfield
quot;; It also possible to pass it in when opening from the command line. Ben Litchfield On Mon, 15 Aug 2005, Andrew Boyd wrote: > Hello all, > After I do my search and display the hits I get back I would like to pass > the seach string that I used with lucene to acrobat reader when it

Re: StackOverflowError when index pdf files

2005-07-20 Thread Ben Litchfield
Yes, this sounds like an issue with PDFBox, can you determine if it is a single PDF document and post an issue on the PDFBox sourceforge site. Thanks, Ben Litchfield On Wed, 20 Jul 2005, Otis Gospodnetic wrote: > It sounds like the problem may stem from your PDF parser >

Re: Lucene - PDFBox

2005-05-25 Thread Ben Litchfield
od friend. > > HELLO > > Legal Soft w are is GOOD. > > > > > I would have expected this... > > GoOD > > March 29, 2005 > > Hello there my good friend. > > HELLO > > Legal Software is GOOD. > > GoOD > > > > - Original Mes

Re: Lucene - PDFBox

2005-05-25 Thread Ben Litchfield
Can you run the following command line application on the PDF to verify that the extracted text is correct java org.pdfbox.ExtractText Ben On Wed, 25 May 2005, Thomas X Hoban wrote: > > > First, I am new to Lucene. > > Is there anyone out there who has had trouble getting hits when running