[jira] [Created] (JENA-2179) TDB throws Unicode Replacement Character exception while fetching data

Holger Knublauch (Jira) Tue, 05 Oct 2021 20:26:08 -0700

Holger Knublauch created JENA-2179:
--------------------------------------

             Summary: TDB throws Unicode Replacement Character exception while 
fetching data
                 Key: JENA-2179
                 URL: https://issues.apache.org/jira/browse/JENA-2179
             Project: Apache Jena
          Issue Type: Bug
          Components: TDB
    Affects Versions: Jena 4.2.0
            Reporter: Holger Knublauch



This seems to have been introduced with 
https://issues.apache.org/jira/browse/JENA-2120

With TDB databases that contain the replacement character in a literal, the 
warnings are reported as Exceptions. We have seen this:

{code:java}
WARN  [http-nio-8083-exec-10] g.e.SimpleDataFetcherExceptionHandler - Exception 
while fetching data (/resources[0]/turtleSourceCode) : [line: 1, col: 318] 
Unicode replacement character U+FFFD in string
org.apache.jena.riot.RiotParseException: [line: 1, col: 318] Unicode 
replacement character U+FFFD in string
        at 
org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerRiotParseException.warning(ErrorHandlerFactory.java:367)
 ~[jena-arq-4.2.0.jar:4.2.0]
        at 
org.apache.jena.riot.tokens.TokenizerText.warning(TokenizerText.java:1332) 
~[jena-arq-4.2.0.jar:4.2.0]
        at 
org.apache.jena.riot.tokens.TokenizerText.readString(TokenizerText.java:768) 
~[jena-arq-4.2.0.jar:4.2.0]
        at 
org.apache.jena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:238) 
~[jena-arq-4.2.0.jar:4.2.0]
        at 
org.apache.jena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:89) 
~[jena-arq-4.2.0.jar:4.2.0]
        at 
org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:119) 
~[jena-tdb-4.2.0.jar:4.2.0]
        at org.apache.jena.tdb.lib.NodeLib.decode(NodeLib.java:118) 
~[jena-tdb-4.2.0.jar:4.2.0]
{code}

TDB seems to use the fallback error handler causing an exception to be thrown 
instead of just printing the warning (to the log).

Richard says he believes a fix would be to change NodecSEE.createTokenizer():

{code:java}
return TokenizerText.create()
    .fromString(string)
    .errorHandler(ErrorHandlerFactory.errorHandlerDetailed())
    .build();
{code}

Is there any known work-around in 4.2.0? We cannot even query those triples 
from the offending TDBs at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (JENA-2179) TDB throws Unicode Replacement Character exception while fetching data

Reply via email to