PDFBox

Tommaso Teofili Wed, 28 Jul 2010 00:31:31 -0700

I attached a patch for Solr 1.4.1 release on
https://issues.apache.org/jira/browse/SOLR-1902 that made things work for
me.
This strange behaviour for me was due to the fact that I copied the patched
jars and war inside the dist directory but forgot to update the war inside
the example/webapps directory (that is inside Jetty).
Hope this helps.
Tommaso


2010/7/27 David Thibault <dthiba...@esperion.com>

> Alessandro & all,
>
> I was having the same issue with Tika crashing on certain PDFs.  I also
> noticed the bug where no content was extracted after upgrading Tika.
>
> When I went to the SOLR issue you link to below, I applied all the patches,
> downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl, and
> got the following error:
> SEVERE: java.lang.NoSuchMethodError:
> org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
> at
> org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> at
> org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
> at
> org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
> at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
> at java.lang.Thread.run(Thread.java:619)
>
> This is really weird because I DID apply the SolrResourceLoader patch that
> adds the getClassLoader method.  I even verified by going opening up the
> JARs and looking at the class file in Eclipse...I can see the
> SolrResourceLoader.getClassLoader() method.
>
> Does anyone know why it can't find the method?  After patching the source I
> did ant clean dist in the base directory of the Solr source tree and
> everything looked like it compiles (BUILD SUCCESSFUL).  Then I copied all
> the jars from dist/ and all the library dependencies from
> contrib/extraction/lib/ into my SOLR_HOME. Restarting tomcat, everything in
> the logs looked good.
>
> I'm stumped.  It would be very nice to have a Solr implementation using the
> newest versions of PDFBox & Tika and actually have content being
> extracted...=)
>
> Best,
> Dave
>
>
> -----Original Message-----
> From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com]
> Sent: Tuesday, July 27, 2010 6:09 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
> CELL/Tika/PDFBox
>
> Hi Jon,
> During the last days we front the same problem.
> Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
> content and from others, Solr throws an exception during the Indexing
> Process .
> You must:
> Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
> snapshot and tika-parsers 0.8.
> Update PdfBox and all related libraries.
> After that You have to patch Solr 1.4.1 following this patch :
>
> https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
> This is the firts way to solve the problem.
>
> Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception
> is
> thrown during the Indexing process, but no content is extracted.
> Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
> sounds good but we don't know how stableit is!
> I hope you have now a clear  vision of this issue,
> Best Regards
>
>
>
> 2010/7/26 Sharp, Jonathan <jsh...@coh.org>
>
> >
> > Every so often I need to index new batches of scanned PDFs and
> occasionally
> > Adobe's OCR can't recognize the text in a couple of these documents. In
> > these situations I would like to type in a small amount of text onto the
> > document and have it be extracted by Solr CELL.
> >
> > Adobe Pro 9 has a number of different ways to add text directly to a PDF
> > file:
> >
> > *Typewriter
> > *Sticky Note
> > *Callout boxes
> > *Text boxes
> >
> > I tried indexing documents with each of these text additions with Solr
> > 1.4.1 + Solr CELL but can't extract the text in any of these boxes.
> >
> > If someone has modified their Solr CELL installation to use more recent
> > versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can
> comment
> > on whether newer versions can pull the text out of any of these various
> text
> > boxes I'd appreciate that very much.
> >
> > -Jon
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > SECURITY/CONFIDENTIALITY WARNING:
> > This message and any attachments are intended solely for the individual
> or
> > entity to which they are addressed. This communication may contain
> > information that is privileged, confidential, or exempt from disclosure
> > under applicable law (e.g., personal health information, research data,
> > financial information). Because this e-mail has been sent without
> > encryption, individuals other than the intended recipient may be able to
> > view the information, forward it to others or tamper with the information
> > without the knowledge or consent of the sender. If you are not the
> intended
> > recipient, or the employee or person responsible for delivering the
> message
> > to the intended recipient, any dissemination, distribution or copying of
> the
> > communication is strictly prohibited. If you received the communication
> in
> > error, please notify the sender immediately by replying to this message
> and
> > deleting the message and any accompanying files from your system. If, due
> to
> > the security risks, you do not wish to receive further communications via
> > e-mail, please reply to this message and inform the sender that you do
> not
> > wish to receive further e-mail from the sender.
> >
> > ---------------------------------------------------------------------
> >
> >
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Personal Page: http://tigerbolt.altervista.org
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>
>

Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

Reply via email to