Hello Sebastian,

I forgot to reply. I ended up putting the entry in parse-plugins.xml and
that fixed my issue.

Thanks,
Steve Cohen

On Wed, Jul 26, 2023 at 1:39 PM Sebastian Nagel
<wastl.na...@googlemail.com.invalid> wrote:

> Hi Steve,
>
>  > copy and pasted an email thread together and there are a few
>  > weird characters in it.
>
> Ok. That explains the error.
>
>
>  > there is a way to tell nutch
>  > to choose some other parser.
>
> Yes, that's possible. In the conf/ folder there is a file
> parse-plugins.xml - if you add the following lines
>
>         <mimeType name="message/rfc822">
>                 <plugin id="parse-html" />
>         </mimeType>
>
> files of MIME type message/rfc822 are parsed using the
> HTML parser.
>
>
>  > there are a few weird characters in it
>
> Might be that the parse-html parser also chokes on that content.
>
>
> Another option could be to manipulate the tika-mimetypes.xml to
> override the MIME detection and forward those files to some
> custom MIME type. But that might not be that easy.
>
>
> Best,
> Sebastian
>
>
> On 7/26/23 18:08, Steve Cohen wrote:
> > Thanks for the reply.
> >
> > I can't share the file but it isn't in eml format. It looks like someone
> > copy and pasted an email thread together and there are a few
> > weird characters in it. I have no problem using less to view it. I am
> > wondering why it is parsing it as email and if there is a way to tell
> nutch
> > to choose some other parser. I have over 500 of the errors so I don't
> want
> > to skip them.
> >
> > Thanks,
> > Steve Cohen
> >
> > On Wed, Jul 26, 2023 at 10:36 AM Sebastian Nagel
> > <wastl.na...@googlemail.com.invalid> wrote:
> >
> >> Hi Steve,
> >>
> >>   >
> >>
> >>
> file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67
> >>
> >> what does the file contain? An .eml file (following RFC822)?
> >> Would it be possible to share this file or at least a chunk large
> >> enough to reproduce the issue?
> >>
> >> The error message might indicate that there are too many headers
> >> - 1000 is the limit for the max. header count, see [1].
> >> But then it's hardly a email message but some other file format
> >> erroneously detected as email.
> >>
> >> In doubt, if parsing this file is mandatory, you could also post
> >> the error on the Tika user mailing list, see [2].
> >>
> >> Best,
> >> Sebastian
> >>
> >> [1]
> >>
> >>
> https://james.apache.org/mime4j/apidocs/org/apache/james/mime4j/stream/MimeConfig.html#setMaxHeaderCount(int)
> >> [2] https://tika.apache.org/mail-lists.html
> >>
> >> On 7/24/23 16:43, Steve Cohen wrote:
> >>> Hello,
> >>>
> >>> I am running nutch 1.19 and I am getting the following error:
> >>>
> >>> 2023-07-21 14:55:38,013 ERROR o.a.n.p.t.TikaParser [parse-0] Error
> >> parsing
> >>>
> >>
> file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67
> >>> org.apache.tika.exception.TikaException: Failed to parse an email
> message
> >>>           at
> >>> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:110)
> >> ~[?:?]
> >>>           at
> >>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> >>> ~[tika-core-2.3.0.jar:2.3.0]
> >>>           at
> >>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> >> ~[?:?]
> >>>           at
> >>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> >> ~[?:?]
> >>>           at
> >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> >>> ~[apache-nutch-1.19.jar:?]
> >>>           at
> >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> >>> ~[apache-nutch-1.19.jar:?]
> >>>           at java.util.concurrent.FutureTask.run(FutureTask.java:264)
> >> ~[?:?]
> >>>           at
> >>>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> >>> ~[?:?]
> >>>           at
> >>>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> >>> ~[?:?]
> >>>           at java.lang.Thread.run(Thread.java:829) ~[?:?]
> >>> Caused by: org.apache.james.mime4j.io.MaxHeaderLimitException: Maximum
> >>> header limit (1000) exceeded
> >>>           at
> >>>
> org.apache.james.mime4j.stream.MimeEntity.nextField(MimeEntity.java:254)
> >>> ~[?:?]
> >>>           at
> >>> org.apache.james.mime4j.stream.MimeEntity.advance(MimeEntity.java:296)
> >>> ~[?:?]
> >>>           at
> >>>
> >>
> org.apache.james.mime4j.stream.MimeTokenStream.next(MimeTokenStream.java:374)
> >>> ~[?:?]
> >>>           at
> >>>
> >>
> org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:176)
> >>> ~[?:?]
> >>>           at
> >>> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:98)
> >> ~[?:?]
> >>>
> >>>
> >>> Is there a way to increase the header limit in nutch-site.xml or
> >> elsewhere?
> >>> I looked through the nutch-defaults.xml and didn't see the property but
> >>> maybe I missed it?
> >>>
> >>> Thanks,
> >>> Steve Cohen
> >>>
> >>
> >
>

Reply via email to