[
https://issues.apache.org/jira/browse/CONNECTORS-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448749#comment-17448749
]
Karl Wright commented on CONNECTORS-1681:
-----------------------------------------
[~julienFL], the database record just needs to not include any non-UTF8
strings. You do not need to limit it to just ASCII. If you read the
description, you will note that the error message says as much: it says you
don't have a valid UTF-8 sequence, and since the input is a Java string, it
must contain codepoints that cannot be represented as UTF-8.
> TikaServiceRmeta: recordActivity can cause Database exception
> -------------------------------------------------------------
>
> Key: CONNECTORS-1681
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1681
> Project: ManifoldCF
> Issue Type: Bug
> Components: Tika service connector
> Affects Versions: ManifoldCF 2.20
> Reporter: Julien Massiera
> Assignee: Julien Massiera
> Priority: Major
> Fix For: ManifoldCF 2.21
>
>
> Some files containing non ASCII characters can cause Tika to trigger an
> exception describing the parsing problem.
> As the TikaServiceRmeta connector creates an activity record for any Tika
> exception containing its description (and so that contains the non ASCII char
> in those cases), it causes an SQL exception when MCF tries to insert the
> activity record in Postgres:
> {code:java}
> ERROR 2021-11-24T13:37:00,121 (Worker thread '41') -
> MCF|MCF-agent|apache.manifoldcf.crawlerthreads|Worker thread aborting and
> restarting due to database connection reset: Database exception: SQLException
> doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
> exception: SQLException doing query (22021): ERROR: invalid byte sequence for
> encoding "UTF8": 0x00 {code}
> So to avoid this, we need to remove any non ASCII chars from the exception
> description before recording the activity
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)