[ https://issues.apache.org/jira/browse/CONNECTORS-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448749#comment-17448749 ]
Karl Wright commented on CONNECTORS-1681: ----------------------------------------- [~julienFL], the database record just needs to not include any non-UTF8 strings. You do not need to limit it to just ASCII. If you read the description, you will note that the error message says as much: it says you don't have a valid UTF-8 sequence, and since the input is a Java string, it must contain codepoints that cannot be represented as UTF-8. > TikaServiceRmeta: recordActivity can cause Database exception > ------------------------------------------------------------- > > Key: CONNECTORS-1681 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1681 > Project: ManifoldCF > Issue Type: Bug > Components: Tika service connector > Affects Versions: ManifoldCF 2.20 > Reporter: Julien Massiera > Assignee: Julien Massiera > Priority: Major > Fix For: ManifoldCF 2.21 > > > Some files containing non ASCII characters can cause Tika to trigger an > exception describing the parsing problem. > As the TikaServiceRmeta connector creates an activity record for any Tika > exception containing its description (and so that contains the non ASCII char > in those cases), it causes an SQL exception when MCF tries to insert the > activity record in Postgres: > {code:java} > ERROR 2021-11-24T13:37:00,121 (Worker thread '41') - > MCF|MCF-agent|apache.manifoldcf.crawlerthreads|Worker thread aborting and > restarting due to database connection reset: Database exception: SQLException > doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00 > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database > exception: SQLException doing query (22021): ERROR: invalid byte sequence for > encoding "UTF8": 0x00 {code} > So to avoid this, we need to remove any non ASCII chars from the exception > description before recording the activity > -- This message was sent by Atlassian Jira (v8.20.1#820001)