Hi Steve,

> copy and pasted an email thread together and there are a few
> weird characters in it.

Ok. That explains the error.


> there is a way to tell nutch
> to choose some other parser.

Yes, that's possible. In the conf/ folder there is a file
parse-plugins.xml - if you add the following lines

        <mimeType name="message/rfc822">
                <plugin id="parse-html" />
        </mimeType>

files of MIME type message/rfc822 are parsed using the
HTML parser.


> there are a few weird characters in it

Might be that the parse-html parser also chokes on that content.


Another option could be to manipulate the tika-mimetypes.xml to
override the MIME detection and forward those files to some
custom MIME type. But that might not be that easy.


Best,
Sebastian


On 7/26/23 18:08, Steve Cohen wrote:
Thanks for the reply.

I can't share the file but it isn't in eml format. It looks like someone
copy and pasted an email thread together and there are a few
weird characters in it. I have no problem using less to view it. I am
wondering why it is parsing it as email and if there is a way to tell nutch
to choose some other parser. I have over 500 of the errors so I don't want
to skip them.

Thanks,
Steve Cohen

On Wed, Jul 26, 2023 at 10:36 AM Sebastian Nagel
<wastl.na...@googlemail.com.invalid> wrote:

Hi Steve,

  >

file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67

what does the file contain? An .eml file (following RFC822)?
Would it be possible to share this file or at least a chunk large
enough to reproduce the issue?

The error message might indicate that there are too many headers
- 1000 is the limit for the max. header count, see [1].
But then it's hardly a email message but some other file format
erroneously detected as email.

In doubt, if parsing this file is mandatory, you could also post
the error on the Tika user mailing list, see [2].

Best,
Sebastian

[1]

https://james.apache.org/mime4j/apidocs/org/apache/james/mime4j/stream/MimeConfig.html#setMaxHeaderCount(int)
[2] https://tika.apache.org/mail-lists.html

On 7/24/23 16:43, Steve Cohen wrote:
Hello,

I am running nutch 1.19 and I am getting the following error:

2023-07-21 14:55:38,013 ERROR o.a.n.p.t.TikaParser [parse-0] Error
parsing

file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67
org.apache.tika.exception.TikaException: Failed to parse an email message
          at
org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:110)
~[?:?]
          at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
~[tika-core-2.3.0.jar:2.3.0]
          at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
~[?:?]
          at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
~[?:?]
          at
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
~[apache-nutch-1.19.jar:?]
          at
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
~[apache-nutch-1.19.jar:?]
          at java.util.concurrent.FutureTask.run(FutureTask.java:264)
~[?:?]
          at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
          at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
          at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: org.apache.james.mime4j.io.MaxHeaderLimitException: Maximum
header limit (1000) exceeded
          at
org.apache.james.mime4j.stream.MimeEntity.nextField(MimeEntity.java:254)
~[?:?]
          at
org.apache.james.mime4j.stream.MimeEntity.advance(MimeEntity.java:296)
~[?:?]
          at

org.apache.james.mime4j.stream.MimeTokenStream.next(MimeTokenStream.java:374)
~[?:?]
          at

org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:176)
~[?:?]
          at
org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:98)
~[?:?]


Is there a way to increase the header limit in nutch-site.xml or
elsewhere?
I looked through the nutch-defaults.xml and didn't see the property but
maybe I missed it?

Thanks,
Steve Cohen



Reply via email to