[ 
https://issues.apache.org/jira/browse/CONNECTORS-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Massiera updated CONNECTORS-1681:
----------------------------------------
    Description: 
Some files containing non UTF8 characters can cause Tika to trigger an 
exception describing the parsing problem. 
As the TikaServiceRmeta connector creates an activity record for any Tika 
exception containing its description (and so that contains the non UTF8 char in 
those cases), it causes an SQL exception when MCF tries to insert the activity 
record in the Database:
{code:java}
ERROR 2021-11-24T13:37:00,121 (Worker thread '41') - 
MCF|MCF-agent|apache.manifoldcf.crawlerthreads|Worker thread aborting and 
restarting due to database connection reset: Database exception: SQLException 
doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: 
SQLException doing query (22021): ERROR: invalid byte sequence for encoding 
"UTF8": 0x00 {code}
So to avoid this, we need to remove those problematic chars from the exception 
description before recording the activity

 

  was:
Some files containing non ASCII characters can cause Tika to trigger an 
exception describing the parsing problem. 
As the TikaServiceRmeta connector creates an activity record for any Tika 
exception containing its description (and so that contains the non ASCII char 
in those cases), it causes an SQL exception when MCF tries to insert the 
activity record in Postgres:
{code:java}
ERROR 2021-11-24T13:37:00,121 (Worker thread '41') - 
MCF|MCF-agent|apache.manifoldcf.crawlerthreads|Worker thread aborting and 
restarting due to database connection reset: Database exception: SQLException 
doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: 
SQLException doing query (22021): ERROR: invalid byte sequence for encoding 
"UTF8": 0x00 {code}
So to avoid this, we need to remove any non ASCII chars from the exception 
description before recording the activity

 


> TikaServiceRmeta: recordActivity can cause Database exception
> -------------------------------------------------------------
>
>                 Key: CONNECTORS-1681
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1681
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Tika service connector
>    Affects Versions: ManifoldCF 2.20
>            Reporter: Julien Massiera
>            Assignee: Julien Massiera
>            Priority: Major
>             Fix For: ManifoldCF 2.21
>
>
> Some files containing non UTF8 characters can cause Tika to trigger an 
> exception describing the parsing problem. 
> As the TikaServiceRmeta connector creates an activity record for any Tika 
> exception containing its description (and so that contains the non UTF8 char 
> in those cases), it causes an SQL exception when MCF tries to insert the 
> activity record in the Database:
> {code:java}
> ERROR 2021-11-24T13:37:00,121 (Worker thread '41') - 
> MCF|MCF-agent|apache.manifoldcf.crawlerthreads|Worker thread aborting and 
> restarting due to database connection reset: Database exception: SQLException 
> doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database 
> exception: SQLException doing query (22021): ERROR: invalid byte sequence for 
> encoding "UTF8": 0x00 {code}
> So to avoid this, we need to remove those problematic chars from the 
> exception description before recording the activity
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to