Thanks for the reply. I can't share the file but it isn't in eml format. It looks like someone copy and pasted an email thread together and there are a few weird characters in it. I have no problem using less to view it. I am wondering why it is parsing it as email and if there is a way to tell nutch to choose some other parser. I have over 500 of the errors so I don't want to skip them.
Thanks, Steve Cohen On Wed, Jul 26, 2023 at 10:36 AM Sebastian Nagel <wastl.na...@googlemail.com.invalid> wrote: > Hi Steve, > > > > > file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67 > > what does the file contain? An .eml file (following RFC822)? > Would it be possible to share this file or at least a chunk large > enough to reproduce the issue? > > The error message might indicate that there are too many headers > - 1000 is the limit for the max. header count, see [1]. > But then it's hardly a email message but some other file format > erroneously detected as email. > > In doubt, if parsing this file is mandatory, you could also post > the error on the Tika user mailing list, see [2]. > > Best, > Sebastian > > [1] > > https://james.apache.org/mime4j/apidocs/org/apache/james/mime4j/stream/MimeConfig.html#setMaxHeaderCount(int) > [2] https://tika.apache.org/mail-lists.html > > On 7/24/23 16:43, Steve Cohen wrote: > > Hello, > > > > I am running nutch 1.19 and I am getting the following error: > > > > 2023-07-21 14:55:38,013 ERROR o.a.n.p.t.TikaParser [parse-0] Error > parsing > > > file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67 > > org.apache.tika.exception.TikaException: Failed to parse an email message > > at > > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:110) > ~[?:?] > > at > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > > ~[tika-core-2.3.0.jar:2.3.0] > > at > > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) > ~[?:?] > > at > > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) > ~[?:?] > > at > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34) > > ~[apache-nutch-1.19.jar:?] > > at > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) > > ~[apache-nutch-1.19.jar:?] > > at java.util.concurrent.FutureTask.run(FutureTask.java:264) > ~[?:?] > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > > ~[?:?] > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > > ~[?:?] > > at java.lang.Thread.run(Thread.java:829) ~[?:?] > > Caused by: org.apache.james.mime4j.io.MaxHeaderLimitException: Maximum > > header limit (1000) exceeded > > at > > org.apache.james.mime4j.stream.MimeEntity.nextField(MimeEntity.java:254) > > ~[?:?] > > at > > org.apache.james.mime4j.stream.MimeEntity.advance(MimeEntity.java:296) > > ~[?:?] > > at > > > org.apache.james.mime4j.stream.MimeTokenStream.next(MimeTokenStream.java:374) > > ~[?:?] > > at > > > org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:176) > > ~[?:?] > > at > > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:98) > ~[?:?] > > > > > > Is there a way to increase the header limit in nutch-site.xml or > elsewhere? > > I looked through the nutch-defaults.xml and didn't see the property but > > maybe I missed it? > > > > Thanks, > > Steve Cohen > > >