[jira] [Commented] (CONNECTORS-1681) TikaServiceRmeta: recordActivity can cause Database exception

2021-11-24 Thread Julien Massiera (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448766#comment-17448766
 ] 

Julien Massiera commented on CONNECTORS-1681:
-

Indeed [~kwri...@metacarta.com], it is the description of my issue that is 
wrong. I decided to remove non ASCII chars and not just non UTF8 chars because 
the description of the error that the TikaServiceRmeta connector is writing as 
activity record is just there to be readable and give a global idea of what was 
wrong during the Tika processing phase. So I wanted to be sure that the 
activity record only contains "standard" chars even if we loose some of them, 
the accurate exception is still available in the log file. Are you ok with that 
? 

> TikaServiceRmeta: recordActivity can cause Database exception
> -
>
> Key: CONNECTORS-1681
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1681
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika service connector
>Affects Versions: ManifoldCF 2.20
>Reporter: Julien Massiera
>Assignee: Julien Massiera
>Priority: Major
> Fix For: ManifoldCF 2.21
>
>
> Some files containing non ASCII characters can cause Tika to trigger an 
> exception describing the parsing problem. 
> As the TikaServiceRmeta connector creates an activity record for any Tika 
> exception containing its description (and so that contains the non ASCII char 
> in those cases), it causes an SQL exception when MCF tries to insert the 
> activity record in Postgres:
> {code:java}
> ERROR 2021-11-24T13:37:00,121 (Worker thread '41') - 
> MCF|MCF-agent|apache.manifoldcf.crawlerthreads|Worker thread aborting and 
> restarting due to database connection reset: Database exception: SQLException 
> doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database 
> exception: SQLException doing query (22021): ERROR: invalid byte sequence for 
> encoding "UTF8": 0x00 {code}
> So to avoid this, we need to remove any non ASCII chars from the exception 
> description before recording the activity
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (CONNECTORS-1681) TikaServiceRmeta: recordActivity can cause Database exception

2021-11-24 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448749#comment-17448749
 ] 

Karl Wright commented on CONNECTORS-1681:
-

[~julienFL], the database record just needs to not include any non-UTF8 
strings.  You do not need to limit it to just ASCII.  If you read the 
description, you will note that the error message says as much: it says you 
don't have a valid UTF-8 sequence, and since the input is a Java string, it 
must contain codepoints that cannot be represented as UTF-8.


> TikaServiceRmeta: recordActivity can cause Database exception
> -
>
> Key: CONNECTORS-1681
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1681
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika service connector
>Affects Versions: ManifoldCF 2.20
>Reporter: Julien Massiera
>Assignee: Julien Massiera
>Priority: Major
> Fix For: ManifoldCF 2.21
>
>
> Some files containing non ASCII characters can cause Tika to trigger an 
> exception describing the parsing problem. 
> As the TikaServiceRmeta connector creates an activity record for any Tika 
> exception containing its description (and so that contains the non ASCII char 
> in those cases), it causes an SQL exception when MCF tries to insert the 
> activity record in Postgres:
> {code:java}
> ERROR 2021-11-24T13:37:00,121 (Worker thread '41') - 
> MCF|MCF-agent|apache.manifoldcf.crawlerthreads|Worker thread aborting and 
> restarting due to database connection reset: Database exception: SQLException 
> doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database 
> exception: SQLException doing query (22021): ERROR: invalid byte sequence for 
> encoding "UTF8": 0x00 {code}
> So to avoid this, we need to remove any non ASCII chars from the exception 
> description before recording the activity
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)