[
http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12372310 ]
Richard Braman commented on NUTCH-220:
--------------------------------------
I upgraded nutch .8 trunk to PDFBox HEAD.
The NullPointer exception Seems to be resolved by upgrading nutch to PDFBox
0.7.3
The major issues in upgrading nutch to 0.7.3 are:
1. PDFBOx now depends on Font Box, which must be included as a plugin
lib-fontbox
2. PDFBox no longer depends on log4j, when I tired to remove references to the
dependency in the build.xml for porase-pdf, it returns assorted ant build
errors, I left the references to log4j and it built fine
someone who has more knowledge of building nutch needs to modify the build and
plugin.xml if refernces to log4j should be removed?
plugin.xml for FontBox
<plugin
id="lib-fontbox"
name="FontBox"
version="0.1.0-dev"
provider-name="org.fontbox">
<runtime>
<library name="FontBox-0.1.0-dev.jar">
<export name="*"/>
</library>
</runtime>
</plugin>
build.xml for lib-fontbox
<project name="lib-fontbox" default="jar">
<import file="../build-plugin.xml"/>
<!--
! Override the compile and jar targets,
! since there is nothing to compile here.
! -->
<target name="compile" depends="init"/>
<target name="jar" depends="compile">
<copy todir="${build.dir}" verbose="true">
<fileset dir="./lib" includes="**/*.jar"/>
</copy>
</target>
</project>
parse-pdf plugin.xml
<plugin
id="parse-pdf"
name="Pdf Parse Plug-in"
version="1.0.0"
provider-name="nutch.org">
<runtime>
<library name="parse-pdf.jar">
<export name="*"/>
</library>
<library name="PDFBox-0.7.3.jar"/>
<library name="log4j-1.2.9.jar"/>
<library name="FontBox-0.1.0-dev.jar"/>
</runtime>
<requires>
<import plugin="nutch-extensionpoints"/>
<import plugin="lib-log4j"/>
<import plugin="lib-fontbox"/>
</requires>
<extension id="org.apache.nutch.parse.pdf"
name="PdfParse"
point="org.apache.nutch.parse.Parser">
<implementation id="org.apache.nutch.parse.pdf.PdfParser"
class="org.apache.nutch.parse.pdf.PdfParser"
contentType="application/pdf"
pathSuffix=""/>
</extension>
</plugin>
parse-pdf build.xml
<project name="parse-pdf" default="jar-core">
<import file="../build-plugin.xml"/>
<!-- Build compilation dependencies -->
<target name="deps-jar">
<ant target="jar" inheritall="false" dir="../lib-log4j"/>
<ant target="jar" inheritall="false" dir="../lib-fontbox"/>
</target>
<!-- Add compilation dependencies to classpath -->
<path id="plugin.deps">
<fileset dir="${nutch.root}/build">
<include name="**/lib-log4j/*.jar" />
<include name="**/lib-fontbox/*.jar" />
</fileset>
</path>
<!-- Deploy Unit test dependencies -->
<target name="deps-test">
<ant target="deploy" inheritall="false" dir="../lib-log4j"/>
<ant target="deploy" inheritall="false" dir="../lib-fontbox"/>
<ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
<ant target="deploy" inheritall="false" dir="../protocol-file"/>
</target>
<!-- for junit test -->
<mkdir dir="${build.test}/data"/>
<copy file="sample/pdftest.pdf" todir="${build.test}/data"/>
</project>
> PDF Box can't parse document: java.lang.NullPointerException
> ------------------------------------------------------------
>
> Key: NUTCH-220
> URL: http://issues.apache.org/jira/browse/NUTCH-220
> Project: Nutch
> Type: Bug
> Environment: PDFBox 0.7.2
> Reporter: Richard Braman
>
> This error was fixed in the ltest build of PDFBOx, which should be tested
> with nutch.
> >> 060228 160354 fetch okay, but can't parse
> >> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
> >> failed(2,0): Can't be handled as pdf document.
> >> java.lang.NullPointerException
> Yes, the NPE should be fixed.
> Ben
> Richard Braman wrote:
> > Hi Bn,
> >
> > We actually got to the bottom of all of them except for 1... The
> > content truncatetion was due to an inconsistancy bug in nutch config .
> > The no permission to extract text is actually true, the author, the NC
> > Department of revenue put this restriction on all of their files (I have
> > asked them to remove it as it hampers public accessability). The Null
> > pointer exception is the only one to deal with that may be due to the
> > parsing bug . Is this one that you are referring to?
> >
> > -----Original Message-----
> > From: Ben Litchfield [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, March 02, 2006 4:07 PM
> > To: Richard Braman
> > Cc: [email protected]; [email protected];
> > [EMAIL PROTECTED]
> > Subject: Re: [PDFBox-user] PDF Parse Error
> >
> >
> >
> > I believe these errors are due to a parsing bug in PDFBox that has
> > been fixed since the 0.7.2 release. Please give the nightly
> > build(should be a drop in replacement) a try from
> > http://www.pdfbox.org/dist and let me know if you are still having
> > issues.
> >
> > Ben
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers