In terms of PDF documents...
PDFBox should work just fine with any latin based languages; at this
time certain PDFs that have CJK characters can pose some issues. In
general english/french/spanish should be fine.
Some PDFs use custom encodings that make it impossible to extract text
and
you need to include the both the bouncy castle jars and FontBox jar.
Both are included with the PDFBox distribution.
Ben
Quoting jim shirreffs <[EMAIL PROTECTED]>:
Thanks I rebuilt PDFbox and got past that problem but now I am getting
Exception in thread "main" java.lang.NoClassDefFoundEr
PDFBox comes with a version of BouncyCastle that will work. It is
likely that other versions will also work as well.
Is there a specific version that you have tried and didn't work?
Ben
Quoting Alixandre Santana <[EMAIL PROTECTED]>:
Hi All,
I got this error when i tried to decrypt a pdf d
PDFBox version 0.6 is quite old and there have been many improvements,
you should look at moving to the newest version 0.7.3, although from the
description of your problem it probably would not resolve it.
If there are a large number of temp files with "pdfbox" in the name then
you are most li
By 300MG I assume you mean 300MB.
You can also try extracting the text outside of lucene by using a
PDFBox command line app.
java org.pdfbox.ExtractText
you may need to increase the JRE memory like this
java -Xmx512m .pdfbox.ExtractText
OR
java -Xmx1024m .pdfbox.ExtractText
If this is
PDFBox can handle multi-byte encodings. There are a couple recent fixes
for CJK languages that are not part of 0.7.2 but are part of the nightly
build.
Ben
On Fri, 10 Feb 2006, Zhang, Lisheng wrote:
> Hi,
>
> Currently we are using PDFBox to process PDF files and
> POI to process DOC/XLS fil
Is this only after the entire indexing process is finished or do you
mean it happens on one of the documents you are extracting text from?
Are you closing the PDDocument objects when you are done with them?
What heap size are you using and have you tried increasing it?
What version of PDFBox?
In addition, the latest version(0.7.2) of PDFBox does not require log4j,
so you could also upgrade to that version.
Ben
On Mon, 17 Oct 2005 [EMAIL PROTECTED] wrote:
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/log4j/Logger
> at org.pdfbox.pdfparser.BaseParser.(Ba
lucene fairly
easily.
I highly suggest you do some tests against your own set of PDF documents.
A new version of PDFBox was released this weekend and does have some
improvements in terms of speed and memory.
Ben Litchfield
PDFBox
http://www.pdfbox.org/
On Mon, 12 Sep 2005 [EMAIL PROTECTED] wrote
quot;;
It also possible to pass it in when opening from the command line.
Ben Litchfield
On Mon, 15 Aug 2005, Andrew Boyd wrote:
> Hello all,
> After I do my search and display the hits I get back I would like to pass
> the seach string that I used with lucene to acrobat reader when it
Yes, this sounds like an issue with PDFBox, can you determine if it is a
single PDF document and post an issue on the PDFBox sourceforge site.
Thanks,
Ben Litchfield
On Wed, 20 Jul 2005, Otis Gospodnetic wrote:
> It sounds like the problem may stem from your PDF parser
>
od friend.
>
> HELLO
>
> Legal Soft w are is GOOD.
>
>
>
>
> I would have expected this...
>
> GoOD
>
> March 29, 2005
>
> Hello there my good friend.
>
> HELLO
>
> Legal Software is GOOD.
>
> GoOD
>
>
>
> - Original Mes
Can you run the following command line application on the PDF to verify
that the extracted text is correct
java org.pdfbox.ExtractText
Ben
On Wed, 25 May 2005, Thomas X Hoban wrote:
>
>
> First, I am new to Lucene.
>
> Is there anyone out there who has had trouble getting hits when running
13 matches
Mail list logo