[ 
https://issues.apache.org/jira/browse/CONNECTORS-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448749#comment-17448749
 ] 

Karl Wright commented on CONNECTORS-1681:
-----------------------------------------

[~julienFL], the database record just needs to not include any non-UTF8 
strings.  You do not need to limit it to just ASCII.  If you read the 
description, you will note that the error message says as much: it says you 
don't have a valid UTF-8 sequence, and since the input is a Java string, it 
must contain codepoints that cannot be represented as UTF-8.


> TikaServiceRmeta: recordActivity can cause Database exception
> -------------------------------------------------------------
>
>                 Key: CONNECTORS-1681
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1681
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Tika service connector
>    Affects Versions: ManifoldCF 2.20
>            Reporter: Julien Massiera
>            Assignee: Julien Massiera
>            Priority: Major
>             Fix For: ManifoldCF 2.21
>
>
> Some files containing non ASCII characters can cause Tika to trigger an 
> exception describing the parsing problem. 
> As the TikaServiceRmeta connector creates an activity record for any Tika 
> exception containing its description (and so that contains the non ASCII char 
> in those cases), it causes an SQL exception when MCF tries to insert the 
> activity record in Postgres:
> {code:java}
> ERROR 2021-11-24T13:37:00,121 (Worker thread '41') - 
> MCF|MCF-agent|apache.manifoldcf.crawlerthreads|Worker thread aborting and 
> restarting due to database connection reset: Database exception: SQLException 
> doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database 
> exception: SQLException doing query (22021): ERROR: invalid byte sequence for 
> encoding "UTF8": 0x00 {code}
> So to avoid this, we need to remove any non ASCII chars from the exception 
> description before recording the activity
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to