Hi Steve,

> file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67

what does the file contain? An .eml file (following RFC822)?
Would it be possible to share this file or at least a chunk large
enough to reproduce the issue?

The error message might indicate that there are too many headers
- 1000 is the limit for the max. header count, see [1].
But then it's hardly a email message but some other file format
erroneously detected as email.

In doubt, if parsing this file is mandatory, you could also post
the error on the Tika user mailing list, see [2].

Best,
Sebastian

[1] https://james.apache.org/mime4j/apidocs/org/apache/james/mime4j/stream/MimeConfig.html#setMaxHeaderCount(int)
[2] https://tika.apache.org/mail-lists.html

On 7/24/23 16:43, Steve Cohen wrote:
Hello,

I am running nutch 1.19 and I am getting the following error:

2023-07-21 14:55:38,013 ERROR o.a.n.p.t.TikaParser [parse-0] Error parsing
file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67
org.apache.tika.exception.TikaException: Failed to parse an email message
         at
org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:110) ~[?:?]
         at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
~[tika-core-2.3.0.jar:2.3.0]
         at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) ~[?:?]
         at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) ~[?:?]
         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
~[apache-nutch-1.19.jar:?]
         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
~[apache-nutch-1.19.jar:?]
         at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
         at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
         at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
         at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: org.apache.james.mime4j.io.MaxHeaderLimitException: Maximum
header limit (1000) exceeded
         at
org.apache.james.mime4j.stream.MimeEntity.nextField(MimeEntity.java:254)
~[?:?]
         at
org.apache.james.mime4j.stream.MimeEntity.advance(MimeEntity.java:296)
~[?:?]
         at
org.apache.james.mime4j.stream.MimeTokenStream.next(MimeTokenStream.java:374)
~[?:?]
         at
org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:176)
~[?:?]
         at
org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:98) ~[?:?]


Is there a way to increase the header limit in nutch-site.xml or elsewhere?
I looked through the nutch-defaults.xml and didn't see the property but
maybe I missed it?

Thanks,
Steve Cohen

Reply via email to