Thanks for the reply.

I can't share the file but it isn't in eml format. It looks like someone
copy and pasted an email thread together and there are a few
weird characters in it. I have no problem using less to view it. I am
wondering why it is parsing it as email and if there is a way to tell nutch
to choose some other parser. I have over 500 of the errors so I don't want
to skip them.

Thanks,
Steve Cohen

On Wed, Jul 26, 2023 at 10:36 AM Sebastian Nagel
<wastl.na...@googlemail.com.invalid> wrote:

> Hi Steve,
>
>  >
>
> file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67
>
> what does the file contain? An .eml file (following RFC822)?
> Would it be possible to share this file or at least a chunk large
> enough to reproduce the issue?
>
> The error message might indicate that there are too many headers
> - 1000 is the limit for the max. header count, see [1].
> But then it's hardly a email message but some other file format
> erroneously detected as email.
>
> In doubt, if parsing this file is mandatory, you could also post
> the error on the Tika user mailing list, see [2].
>
> Best,
> Sebastian
>
> [1]
>
> https://james.apache.org/mime4j/apidocs/org/apache/james/mime4j/stream/MimeConfig.html#setMaxHeaderCount(int)
> [2] https://tika.apache.org/mail-lists.html
>
> On 7/24/23 16:43, Steve Cohen wrote:
> > Hello,
> >
> > I am running nutch 1.19 and I am getting the following error:
> >
> > 2023-07-21 14:55:38,013 ERROR o.a.n.p.t.TikaParser [parse-0] Error
> parsing
> >
> file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67
> > org.apache.tika.exception.TikaException: Failed to parse an email message
> >          at
> > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:110)
> ~[?:?]
> >          at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> > ~[tika-core-2.3.0.jar:2.3.0]
> >          at
> > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> ~[?:?]
> >          at
> > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> ~[?:?]
> >          at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> > ~[apache-nutch-1.19.jar:?]
> >          at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> > ~[apache-nutch-1.19.jar:?]
> >          at java.util.concurrent.FutureTask.run(FutureTask.java:264)
> ~[?:?]
> >          at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> > ~[?:?]
> >          at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> > ~[?:?]
> >          at java.lang.Thread.run(Thread.java:829) ~[?:?]
> > Caused by: org.apache.james.mime4j.io.MaxHeaderLimitException: Maximum
> > header limit (1000) exceeded
> >          at
> > org.apache.james.mime4j.stream.MimeEntity.nextField(MimeEntity.java:254)
> > ~[?:?]
> >          at
> > org.apache.james.mime4j.stream.MimeEntity.advance(MimeEntity.java:296)
> > ~[?:?]
> >          at
> >
> org.apache.james.mime4j.stream.MimeTokenStream.next(MimeTokenStream.java:374)
> > ~[?:?]
> >          at
> >
> org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:176)
> > ~[?:?]
> >          at
> > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:98)
> ~[?:?]
> >
> >
> > Is there a way to increase the header limit in nutch-site.xml or
> elsewhere?
> > I looked through the nutch-defaults.xml and didn't see the property but
> > maybe I missed it?
> >
> > Thanks,
> > Steve Cohen
> >
>

Reply via email to