Here's an example of using Tika in a stand-alone Java program. https://lucidworks.com/blog/indexing-with-solrj/
Best, Erick On Fri, Jan 16, 2015 at 7:42 AM, Jack Krupansky <jack.krupan...@gmail.com> wrote: > It would be nice to have a SolrJ-level implementation as well as a > command-line implementation of the extraction request handler so that app > ingestion code could do the extraction outside of Solr at the app level and > even as a separate process to stream to the app or Solr. That would permit > the to do customization, entity extraction, boiler-plate removal, etc. in > app-friendly code, before transport to the Solr server. > > The extraction request handler is a really cool feature and quite > sufficient for a lot of scenarios, but additional architectural flexibility > would be a big win. > > -- Jack Krupansky > > On Fri, Jan 16, 2015 at 10:21 AM, Charlie Hull <char...@flax.co.uk> wrote: > >> On 16/01/2015 04:02, Dan Davis wrote: >> >>> Why re-write all the document conversion in Java ;) Tika is very slow. >>> 5 >>> GB PDF is very big. >>> >> >> Or you can run Tika in a separate process, or even on a separate machine, >> wrapped with something to cope if it dies due to some horrible input...we >> generally avoid document format translation within Solr and do it >> externally before feeding documents to Solr. >> >> Charlie >> >> >>> If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output >>> mode. The HTML mode captures some meta-data that would otherwise be >>> lost. >>> >>> >>> If you need to go faster still, you can also write some stuff linked >>> directly against poppler library. >>> >>> Before you jump down by through about Tika being slow - I wrote a PDF >>> indexer that ran at 36 MB/s per core. Different indexer, all C, lots of >>> getjmp/longjmp. But fast... >>> >>> >>> >>> On Thu, Jan 15, 2015 at 1:54 PM, <ganesh.ya...@sungard.com> wrote: >>> >>> Siegfried and Michael Thank you for your replies and help. >>>> >>>> -----Original Message----- >>>> From: Siegfried Goeschl [mailto:sgoes...@gmx.at] >>>> Sent: Thursday, January 15, 2015 3:45 AM >>>> To: solr-user@lucene.apache.org >>>> Subject: Re: OutOfMemoryError for PDF document upload into Solr >>>> >>>> Hi Ganesh, >>>> >>>> you can increase the heap size but parsing a 4 GB PDF document will very >>>> likely consume A LOT OF memory - I think you need to check if that large >>>> PDF can be parsed at all :-) >>>> >>>> Cheers, >>>> >>>> Siegfried Goeschl >>>> >>>> On 14.01.15 18:04, Michael Della Bitta wrote: >>>> >>>>> Yep, you'll have to increase the heap size for your Tomcat container. >>>>> >>>>> http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial >>>>> -heap-size-correctly >>>>> >>>>> Michael Della Bitta >>>>> >>>>> Senior Software Engineer >>>>> >>>>> o: +1 646 532 3062 >>>>> >>>>> appinions inc. >>>>> >>>>> “The Science of Influence Marketing” >>>>> >>>>> 18 East 41st Street >>>>> >>>>> New York, NY 10017 >>>>> >>>>> t: @appinions <https://twitter.com/Appinions> | g+: >>>>> plus.google.com/appinions >>>>> <https://plus.google.com/u/0/b/112002776285509593336/11200277628550959 >>>>> 3336/posts> >>>>> w: appinions.com <http://www.appinions.com/> >>>>> >>>>> On Wed, Jan 14, 2015 at 12:00 PM, <ganesh.ya...@sungard.com> wrote: >>>>> >>>>> Hello, >>>>>> >>>>>> Can someone pass on the hints to get around following error? Is there >>>>>> any Heap Size parameter I can set in Tomcat or in Solr webApp that >>>>>> gets deployed in Solr? >>>>>> >>>>>> I am running Solr webapp inside Tomcat on my local machine which has >>>>>> RAM of 12 GB. I have PDF document which is 4 GB max in size that >>>>>> needs to be loaded into Solr >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Exception in thread "http-apr-8983-exec-6" java.lang. : Java heap >>>>>> >>>>> space >>>> >>>>> at java.util.AbstractCollection.toArray(Unknown Source) >>>>>> at java.util.ArrayList.<init>(Unknown Source) >>>>>> at >>>>>> org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518) >>>>>> at >>>>>> >>>>> org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575) >>>> >>>>> at >>>>>> >>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254) >>>> >>>>> at >>>>>> >>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238) >>>> >>>>> at >>>>>> >>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203) >>>> >>>>> at >>>>>> >>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111) >>>> >>>>> at >>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) >>>>>> at >>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) >>>>>> at >>>>>> org.apache.tika.parser.AutoDetectParser.parse( >>>>>> AutoDetectParser.java:120) >>>>>> at >>>>>> >>>>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load( >>>> ExtractingDocumentLoader.java:219) >>>> >>>>> at >>>>>> >>>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody( >>>> ContentStreamHandlerBase.java:74) >>>> >>>>> at >>>>>> >>>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest( >>>> RequestHandlerBase.java:135) >>>> >>>>> at >>>>>> >>>>>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper. >>>> handleRequest(RequestHandlers.java:246) >>>> >>>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) >>>>>> at >>>>>> >>>>>> org.apache.solr.servlet.SolrDispatchFilter.execute( >>>> SolrDispatchFilter.java:777) >>>> >>>>> at >>>>>> >>>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter( >>>> SolrDispatchFilter.java:418) >>>> >>>>> at >>>>>> >>>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter( >>>> SolrDispatchFilter.java:207) >>>> >>>>> at >>>>>> >>>>>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( >>>> ApplicationFilterChain.java:241) >>>> >>>>> at >>>>>> >>>>>> org.apache.catalina.core.ApplicationFilterChain.doFilter( >>>> ApplicationFilterChain.java:208) >>>> >>>>> at >>>>>> >>>>>> org.apache.catalina.core.StandardWrapperValve.invoke( >>>> StandardWrapperValve.java:220) >>>> >>>>> at >>>>>> >>>>>> org.apache.catalina.core.StandardContextValve.invoke( >>>> StandardContextValve.java:122) >>>> >>>>> at >>>>>> >>>>>> org.apache.catalina.core.StandardHostValve.invoke( >>>> StandardHostValve.java:170) >>>> >>>>> at >>>>>> >>>>>> org.apache.catalina.valves.ErrorReportValve.invoke( >>>> ErrorReportValve.java:103) >>>> >>>>> at >>>>>> >>>>>> org.apache.catalina.valves.AccessLogValve.invoke( >>>> AccessLogValve.java:950) >>>> >>>>> at >>>>>> >>>>>> org.apache.catalina.core.StandardEngineValve.invoke( >>>> StandardEngineValve.java:116) >>>> >>>>> at >>>>>> >>>>>> org.apache.catalina.connector.CoyoteAdapter.service( >>>> CoyoteAdapter.java:421) >>>> >>>>> at >>>>>> >>>>>> org.apache.coyote.http11.AbstractHttp11Processor.process( >>>> AbstractHttp11Processor.java:1070) >>>> >>>>> at >>>>>> >>>>>> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler. >>>> process(AbstractProtocol.java:611) >>>> >>>>> at >>>>>> >>>>>> org.apache.tomcat.util.net.AprEndpoint$SocketProcessor. >>>> doRun(AprEndpoint.java:2462) >>>> >>>>> at >>>>>> org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoin >>>>>> t.java:2451) >>>>>> >>>>>> >>>>>> Thanks >>>>>> Ganesh >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>> >> >> -- >> Charlie Hull >> Flax - Open Source Enterprise Search >> >> tel/fax: +44 (0)8700 118334 >> mobile: +44 (0)7767 825828 >> web: www.flax.co.uk >>