Here's an example of using Tika in a stand-alone Java program.

https://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Fri, Jan 16, 2015 at 7:42 AM, Jack Krupansky
<jack.krupan...@gmail.com> wrote:
> It would be nice to have a SolrJ-level implementation as well as a
> command-line implementation of the extraction request handler so that app
> ingestion code could do the extraction outside of Solr at the app level and
> even as a separate process to stream to the app or Solr. That would permit
> the  to do customization, entity extraction, boiler-plate removal, etc. in
> app-friendly code, before transport to the Solr server.
>
> The extraction request handler is a really cool feature and quite
> sufficient for a lot of scenarios, but additional architectural flexibility
> would be a big win.
>
> -- Jack Krupansky
>
> On Fri, Jan 16, 2015 at 10:21 AM, Charlie Hull <char...@flax.co.uk> wrote:
>
>> On 16/01/2015 04:02, Dan Davis wrote:
>>
>>> Why re-write all the document conversion in Java ;)  Tika is very slow.
>>>  5
>>> GB PDF is very big.
>>>
>>
>> Or you can run Tika in a separate process, or even on a separate machine,
>> wrapped with something to cope if it dies due to some horrible input...we
>> generally avoid document format translation within Solr and do it
>> externally before feeding documents to Solr.
>>
>> Charlie
>>
>>
>>> If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output
>>> mode.   The HTML mode captures some meta-data that would otherwise be
>>> lost.
>>>
>>>
>>> If you need to go faster still, you can  also write some stuff linked
>>> directly against poppler library.
>>>
>>> Before you jump down by through about Tika being slow - I wrote a PDF
>>> indexer that ran at 36 MB/s per core.   Different indexer, all C, lots of
>>> getjmp/longjmp.   But fast...
>>>
>>>
>>>
>>> On Thu, Jan 15, 2015 at 1:54 PM, <ganesh.ya...@sungard.com> wrote:
>>>
>>>  Siegfried and Michael Thank you for your replies and help.
>>>>
>>>> -----Original Message-----
>>>> From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
>>>> Sent: Thursday, January 15, 2015 3:45 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: OutOfMemoryError for PDF document upload into Solr
>>>>
>>>> Hi Ganesh,
>>>>
>>>> you can increase the heap size but parsing a 4 GB PDF document will very
>>>> likely consume A LOT OF memory - I think you need to check if that large
>>>> PDF can be parsed at all :-)
>>>>
>>>> Cheers,
>>>>
>>>> Siegfried Goeschl
>>>>
>>>> On 14.01.15 18:04, Michael Della Bitta wrote:
>>>>
>>>>> Yep, you'll have to increase the heap size for your Tomcat container.
>>>>>
>>>>> http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial
>>>>> -heap-size-correctly
>>>>>
>>>>> Michael Della Bitta
>>>>>
>>>>> Senior Software Engineer
>>>>>
>>>>> o: +1 646 532 3062
>>>>>
>>>>> appinions inc.
>>>>>
>>>>> “The Science of Influence Marketing”
>>>>>
>>>>> 18 East 41st Street
>>>>>
>>>>> New York, NY 10017
>>>>>
>>>>> t: @appinions <https://twitter.com/Appinions> | g+:
>>>>> plus.google.com/appinions
>>>>> <https://plus.google.com/u/0/b/112002776285509593336/11200277628550959
>>>>> 3336/posts>
>>>>> w: appinions.com <http://www.appinions.com/>
>>>>>
>>>>> On Wed, Jan 14, 2015 at 12:00 PM, <ganesh.ya...@sungard.com> wrote:
>>>>>
>>>>>  Hello,
>>>>>>
>>>>>> Can someone pass on the hints to get around following error? Is there
>>>>>> any Heap Size parameter I can set in Tomcat or in Solr webApp that
>>>>>> gets deployed in Solr?
>>>>>>
>>>>>> I am running Solr webapp inside Tomcat on my local machine which has
>>>>>> RAM of 12 GB. I have PDF document which is 4 GB max in size that
>>>>>> needs to be loaded into Solr
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Exception in thread "http-apr-8983-exec-6" java.lang.    : Java heap
>>>>>>
>>>>> space
>>>>
>>>>>           at java.util.AbstractCollection.toArray(Unknown Source)
>>>>>>           at java.util.ArrayList.<init>(Unknown Source)
>>>>>>           at
>>>>>> org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
>>>>>>           at
>>>>>>
>>>>> org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)
>>>>
>>>>>           at
>>>>>>
>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)
>>>>
>>>>>           at
>>>>>>
>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
>>>>
>>>>>           at
>>>>>>
>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
>>>>
>>>>>           at
>>>>>>
>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
>>>>
>>>>>           at
>>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>>>>>           at
>>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>>>>>           at
>>>>>> org.apache.tika.parser.AutoDetectParser.parse(
>>>>>> AutoDetectParser.java:120)
>>>>>>           at
>>>>>>
>>>>>>  org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(
>>>> ExtractingDocumentLoader.java:219)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(
>>>> ContentStreamHandlerBase.java:74)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.solr.handler.RequestHandlerBase.handleRequest(
>>>> RequestHandlerBase.java:135)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.
>>>> handleRequest(RequestHandlers.java:246)
>>>>
>>>>>           at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
>>>>>>           at
>>>>>>
>>>>>>  org.apache.solr.servlet.SolrDispatchFilter.execute(
>>>> SolrDispatchFilter.java:777)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>>>> SolrDispatchFilter.java:418)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>>>> SolrDispatchFilter.java:207)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
>>>> ApplicationFilterChain.java:241)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.catalina.core.ApplicationFilterChain.doFilter(
>>>> ApplicationFilterChain.java:208)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.catalina.core.StandardWrapperValve.invoke(
>>>> StandardWrapperValve.java:220)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.catalina.core.StandardContextValve.invoke(
>>>> StandardContextValve.java:122)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.catalina.core.StandardHostValve.invoke(
>>>> StandardHostValve.java:170)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.catalina.valves.ErrorReportValve.invoke(
>>>> ErrorReportValve.java:103)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.catalina.valves.AccessLogValve.invoke(
>>>> AccessLogValve.java:950)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.catalina.core.StandardEngineValve.invoke(
>>>> StandardEngineValve.java:116)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.catalina.connector.CoyoteAdapter.service(
>>>> CoyoteAdapter.java:421)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.coyote.http11.AbstractHttp11Processor.process(
>>>> AbstractHttp11Processor.java:1070)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.
>>>> process(AbstractProtocol.java:611)
>>>>
>>>>>           at
>>>>>>
>>>>>>  org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.
>>>> doRun(AprEndpoint.java:2462)
>>>>
>>>>>           at
>>>>>> org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoin
>>>>>> t.java:2451)
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Ganesh
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>> --
>> Charlie Hull
>> Flax - Open Source Enterprise Search
>>
>> tel/fax: +44 (0)8700 118334
>> mobile:  +44 (0)7767 825828
>> web: www.flax.co.uk
>>

Reply via email to