i run

        tika 2.8.0

it's used to attachment scan for a dovecot imap server

it runs on an external (to dovecot) server, on the same lan

it's up & running

        ps ax | grep tika
                63506 ?        Ssl    0:00 /usr/bin/java 
-Dpdfbox.fontcache=/var/tika -XX:ParallelGCThreads=1 -XX:CICompilerCount=2 
-XX:-CICompilerCountPerCPU -jar /srv/apps/tika/tika-server.jar -c 
/usr/local/etc/tika/tika-server-config-custom.xml --host 10.1.7.100 --port 9998
                63540 ?        Sl     0:02 /usr/bin/java -Xms1g -Xmx1g 
-Dpdfbox.fontcache=/var/tika -Dlog4j2.warn -Djava.awt.headless=true -cp 
/srv/apps/tika/tika-server.jar -Dtika.server.id= 
org.apache.tika.server.core.TikaServerProcess -h 10.1.7.100 -p 9998 -i  -c 
/usr/local/etc/tika/tika-server-config-custom.xml -forkedStatusFile 
/tmp/apache-tika-server-forked-tmp-15836749653669077604 -numRestarts 0

dovecot config for using tika instance is

        fts_tika = http://10.1.7.100:9998/tika/

testing a local PDF on the tika server

        F="/tmp/TEST.pdf"
        /bin/cp -af $F /tmp/test.pdf
        chown vmail:vmail /tmp/test.pdf
        curl \
        -T /tmp/test.pdf \
        http://10.1.7.100:9998/meta

                <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 
Test.SNAPSHOT">
                  <rdf:RDF 
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";>
                    <rdf:Description rdf:about=""
                        xmlns:pdf="http://ns.adobe.com/pdf/1.3/";
                        xmlns:xmp="http://ns.adobe.com/xap/1.0/";
                        xmlns:dc="http://purl.org/dc/elements/1.1/";
                        xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/";
                        xmlns:xmpTPg="http://ns.adobe.com/xap/1.0/t/pg/";
                      pdf:PDFVersion="1.4"
                      pdf:hasXFA="false"
                      pdf:num3DAnnotations="0"
                      pdf:overallPercentageUnmappedUnicodeChars="0.0"
                      pdf:hasCollection="false"
                      pdf:encrypted="false"
                      pdf:containsNonEmbeddedFont="false"
                      pdf:hasMarkedContent="true"
                      pdf:producer="Adobe PDF Library 15.0"
                      pdf:totalUnmappedUnicodeChars="0"
                      pdf:hasXMP="true"
                      pdf:containsDamagedFont="false"
                      xmp:CreatorTool="Adobe InDesign 15.1 (Macintosh)"
                      dc:format="application/pdf; version=1.4"
                      dc:language="en-US"
                      
xmpMM:DocumentID="xmp.id:8a612346-9d03-4caf-8ebf-da6f3716ed0a"
                      xmpTPg:NPages="14">
                      <pdf:unmappedUnicodeCharsPerPage>
                        <rdf:Seq>
                          <rdf:li>0</rdf:li>
                          <rdf:li>0</rdf:li>
                          <rdf:li>0</rdf:li>
                          <rdf:li>0</rdf:li>
                          <rdf:li>0</rdf:li>
                          <rdf:li>0</rdf:li>
                          <rdf:li>0</rdf:li>
                          <rdf:li>0</rdf:li>
                          <rdf:li>0</rdf:li>
                          <rdf:li>0</rdf:li>
                          <rdf:li>0</rdf:li>
                          <rdf:li>0</rdf:li>
                          <rdf:li>0</rdf:li>
                          <rdf:li>0</rdf:li>
                        </rdf:Seq>
                      </pdf:unmappedUnicodeCharsPerPage>
                      <pdf:charsPerPage>
                        <rdf:Seq>
                          <rdf:li>84</rdf:li>
                          <rdf:li>676</rdf:li>
                          <rdf:li>1653</rdf:li>
                          <rdf:li>1914</rdf:li>
                          <rdf:li>814</rdf:li>
                          <rdf:li>1022</rdf:li>
                          <rdf:li>645</rdf:li>
                          <rdf:li>1221</rdf:li>
                          <rdf:li>1087</rdf:li>
                          <rdf:li>732</rdf:li>
                          <rdf:li>887</rdf:li>
                          <rdf:li>1295</rdf:li>
                          <rdf:li>1263</rdf:li>
                          <rdf:li>149</rdf:li>
                        </rdf:Seq>
                      </pdf:charsPerPage>
                      <pdf:annotationTypes>
                        <rdf:Bag>
                          <rdf:li>null</rdf:li>
                        </rdf:Bag>
                      </pdf:annotationTypes>
                      <pdf:annotationSubtypes>
                        <rdf:Bag>
                          <rdf:li>Link</rdf:li>
                        </rdf:Bag>
                      </pdf:annotationSubtypes>
                    </rdf:Description>
                  </rdf:RDF>
                </x:xmpmeta>


passing/processing an email with an *.pdf attachment from dovecot, logs ok,

        Jul 11 08:12:50 svr003 tika[63540]: INFO  [qtp1164394344-41] 
09:12:50,042 org.apache.tika.server.core.TikaLoggingFilter Request URI: 
http://10.1.7.100:9998/tika/
        Jul 11 08:12:50 svr003 tika[63540]: INFO  [qtp1164394344-41] 
09:12:50,043 org.apache.tika.server.core.resource.TikaResource /tika 
(application/pdf)

and results are passed back to dovecot, and scan/index db is updated accordingly

but passing/processing an email with an embedded (forwarded as attachment) 
*.eml, logs the following 'SEVERE' error,

        Jul 11 08:36:49 svr003 tika[62540]: INFO  [qtp1164241227-41] 
08:36:49,417 org.apache.tika.server.core.TikaLoggingFilter Request URI: 
http://10.1.7.100:9998/tika/
        Jul 11 08:36:49 svr003 tika[62540]: INFO  [qtp1164241227-41] 
08:36:49,418 org.apache.tika.server.core.resource.TikaResource /tika 
(message/rfc822)
        Jul 11 08:36:49 svr003 tika[62540]: WARN  [qtp1164241227-41] 
08:36:49,419 org.apache.tika.server.core.resource.TikaResource tika/: Text 
extraction failed ([0-9961000034519].eml)
        Jul 11 08:36:49 svr003 tika[62540]: 
org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:185) 
~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152) 
~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:57) 
~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:357) 
~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.tika.server.core.resource.TikaResource.lambda$produceText$1(TikaResource.java:507)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:177)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1651) 
~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:249)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:122)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:84)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:90)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) 
~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) 
~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
 ~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) 
~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.server.Server.handle(Server.java:516) 
~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) 
~[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) 
[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) 
[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) 
[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
 [tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) 
[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) 
[tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
 [tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
 [tika-server-standard-2.8.0.jar:2.8.0]
        Jul 11 08:36:49 svr003 tika[62540]:         at 
java.lang.Thread.run(Thread.java:833) [?:?]
        Jul 11 08:36:49 svr003 tika[62540]: Jul 11, 2023 8:36:49 AM 
org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerProblem
        Jul 11 08:36:49 svr003 tika[62540]: SEVERE: Problem with writing the 
data, class 
org.apache.tika.server.core.resource.TikaResource$$Lambda$371/0x00000008012ab9e0,
 ContentType: text/plain


iiuc, .eml should be parseable

        https://tika.apache.org/2.8.0/formats.html#Mail_formats
        
https://tika.apache.org/2.8.0/api/org/apache/tika/parser/mail/RFC822Parser.html

is there additional/different config needed for .eml processing ?

Reply via email to