[jira] [Commented] (JENA-2179) TDB throws Unicode Replacement Character exception while fetching data

Andy Seaborne (Jira) Wed, 06 Oct 2021 02:25:05 -0700


    [ 
https://issues.apache.org/jira/browse/JENA-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424870#comment-17424870
 ]


Andy Seaborne commented on JENA-2179:
-------------------------------------

How did an illegal Unicode character get into the database?

U+FFFD indicates a loss of information at best, and more likely corrupt input 
data due to bad encoding suggesting it is more than just a point problem.

TDB2 will not have this issue - it does not use NodecSSE - but it does not fix 
data so while it should handle it, it can't fix the data.

 
Because of caching of the node table, it is difficult to reproduce. Please 
could you provide a test case of TDB1 and at the component level as you've 
looked at them already - TokenizerText (TestTokenizer) and NodecSSE (TestCodec) 
have test suites

 
bq. seems to have been introduced

So before that it was OK? Or a different error occurred?

bq. believes a fix would be

Have you tried it?


> TDB throws Unicode Replacement Character exception while fetching data
> ----------------------------------------------------------------------
>
>                 Key: JENA-2179
>                 URL: https://issues.apache.org/jira/browse/JENA-2179
>             Project: Apache Jena
>          Issue Type: Bug
>          Components: TDB
>    Affects Versions: Jena 4.2.0
>            Reporter: Holger Knublauch
>            Priority: Major
>
> This seems to have been introduced with 
> https://issues.apache.org/jira/browse/JENA-2120
> With TDB databases that contain the replacement character in a literal, the 
> warnings are reported as Exceptions. We have seen this:
> {code:java}
> WARN  [http-nio-8083-exec-10] g.e.SimpleDataFetcherExceptionHandler - 
> Exception while fetching data (/resources[0]/turtleSourceCode) : [line: 1, 
> col: 318] Unicode replacement character U+FFFD in string
> org.apache.jena.riot.RiotParseException: [line: 1, col: 318] Unicode 
> replacement character U+FFFD in string
>       at 
> org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerRiotParseException.warning(ErrorHandlerFactory.java:367)
>  ~[jena-arq-4.2.0.jar:4.2.0]
>       at 
> org.apache.jena.riot.tokens.TokenizerText.warning(TokenizerText.java:1332) 
> ~[jena-arq-4.2.0.jar:4.2.0]
>       at 
> org.apache.jena.riot.tokens.TokenizerText.readString(TokenizerText.java:768) 
> ~[jena-arq-4.2.0.jar:4.2.0]
>       at 
> org.apache.jena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:238) 
> ~[jena-arq-4.2.0.jar:4.2.0]
>       at 
> org.apache.jena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:89) 
> ~[jena-arq-4.2.0.jar:4.2.0]
>       at 
> org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:119) 
> ~[jena-tdb-4.2.0.jar:4.2.0]
>       at org.apache.jena.tdb.lib.NodeLib.decode(NodeLib.java:118) 
> ~[jena-tdb-4.2.0.jar:4.2.0]
> {code}
> TDB seems to use the fallback error handler causing an exception to be thrown 
> instead of just printing the warning (to the log).
> Richard says he believes a fix would be to change NodecSEE.createTokenizer():
> {code:java}
> return TokenizerText.create()
>     .fromString(string)
>     .errorHandler(ErrorHandlerFactory.errorHandlerDetailed())
>     .build();
> {code}
> Is there any known work-around in 4.2.0? We cannot even query those triples 
> from the offending TDBs at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (JENA-2179) TDB throws Unicode Replacement Character exception while fetching data

Reply via email to