Hello Sebastian, I forgot to reply. I ended up putting the entry in parse-plugins.xml and that fixed my issue.
Thanks, Steve Cohen On Wed, Jul 26, 2023 at 1:39 PM Sebastian Nagel <wastl.na...@googlemail.com.invalid> wrote: > Hi Steve, > > > copy and pasted an email thread together and there are a few > > weird characters in it. > > Ok. That explains the error. > > > > there is a way to tell nutch > > to choose some other parser. > > Yes, that's possible. In the conf/ folder there is a file > parse-plugins.xml - if you add the following lines > > <mimeType name="message/rfc822"> > <plugin id="parse-html" /> > </mimeType> > > files of MIME type message/rfc822 are parsed using the > HTML parser. > > > > there are a few weird characters in it > > Might be that the parse-html parser also chokes on that content. > > > Another option could be to manipulate the tika-mimetypes.xml to > override the MIME detection and forward those files to some > custom MIME type. But that might not be that easy. > > > Best, > Sebastian > > > On 7/26/23 18:08, Steve Cohen wrote: > > Thanks for the reply. > > > > I can't share the file but it isn't in eml format. It looks like someone > > copy and pasted an email thread together and there are a few > > weird characters in it. I have no problem using less to view it. I am > > wondering why it is parsing it as email and if there is a way to tell > nutch > > to choose some other parser. I have over 500 of the errors so I don't > want > > to skip them. > > > > Thanks, > > Steve Cohen > > > > On Wed, Jul 26, 2023 at 10:36 AM Sebastian Nagel > > <wastl.na...@googlemail.com.invalid> wrote: > > > >> Hi Steve, > >> > >> > > >> > >> > file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67 > >> > >> what does the file contain? An .eml file (following RFC822)? > >> Would it be possible to share this file or at least a chunk large > >> enough to reproduce the issue? > >> > >> The error message might indicate that there are too many headers > >> - 1000 is the limit for the max. header count, see [1]. > >> But then it's hardly a email message but some other file format > >> erroneously detected as email. > >> > >> In doubt, if parsing this file is mandatory, you could also post > >> the error on the Tika user mailing list, see [2]. > >> > >> Best, > >> Sebastian > >> > >> [1] > >> > >> > https://james.apache.org/mime4j/apidocs/org/apache/james/mime4j/stream/MimeConfig.html#setMaxHeaderCount(int) > >> [2] https://tika.apache.org/mail-lists.html > >> > >> On 7/24/23 16:43, Steve Cohen wrote: > >>> Hello, > >>> > >>> I am running nutch 1.19 and I am getting the following error: > >>> > >>> 2023-07-21 14:55:38,013 ERROR o.a.n.p.t.TikaParser [parse-0] Error > >> parsing > >>> > >> > file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67 > >>> org.apache.tika.exception.TikaException: Failed to parse an email > message > >>> at > >>> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:110) > >> ~[?:?] > >>> at > >>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) > >>> ~[tika-core-2.3.0.jar:2.3.0] > >>> at > >>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) > >> ~[?:?] > >>> at > >>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) > >> ~[?:?] > >>> at > >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34) > >>> ~[apache-nutch-1.19.jar:?] > >>> at > >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) > >>> ~[apache-nutch-1.19.jar:?] > >>> at java.util.concurrent.FutureTask.run(FutureTask.java:264) > >> ~[?:?] > >>> at > >>> > >> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > >>> ~[?:?] > >>> at > >>> > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > >>> ~[?:?] > >>> at java.lang.Thread.run(Thread.java:829) ~[?:?] > >>> Caused by: org.apache.james.mime4j.io.MaxHeaderLimitException: Maximum > >>> header limit (1000) exceeded > >>> at > >>> > org.apache.james.mime4j.stream.MimeEntity.nextField(MimeEntity.java:254) > >>> ~[?:?] > >>> at > >>> org.apache.james.mime4j.stream.MimeEntity.advance(MimeEntity.java:296) > >>> ~[?:?] > >>> at > >>> > >> > org.apache.james.mime4j.stream.MimeTokenStream.next(MimeTokenStream.java:374) > >>> ~[?:?] > >>> at > >>> > >> > org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:176) > >>> ~[?:?] > >>> at > >>> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:98) > >> ~[?:?] > >>> > >>> > >>> Is there a way to increase the header limit in nutch-site.xml or > >> elsewhere? > >>> I looked through the nutch-defaults.xml and didn't see the property but > >>> maybe I missed it? > >>> > >>> Thanks, > >>> Steve Cohen > >>> > >> > > >