[ https://issues.apache.org/jira/browse/JENA-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425520#comment-17425520 ]
Richard Cyganiak commented on JENA-2179: ---------------------------------------- The ITIL glossaries Turtle file contains the FFFD character. No idea how it got there. Clearly a character has been corrupted there at some point, but this is not relevant here. FFFD itself is not an illegal Unicode character, and its presence does not make a Turtle file invalid. Encountering data that contains FFFD in the wild is not surprising. A test case (Junit 4) is attached. It creates a new TDB1 in a temp directory, inserts a triple containing FFFD, closes the TDB, re-opens it, and attempts to read the triple. Passes on Jena 4.1, throws an exception as per issue description on Jena 4.2, and passes after the suggested change to TokenizerText has been applied. > TDB throws Unicode Replacement Character exception while fetching data > ---------------------------------------------------------------------- > > Key: JENA-2179 > URL: https://issues.apache.org/jira/browse/JENA-2179 > Project: Apache Jena > Issue Type: Bug > Components: TDB > Affects Versions: Jena 4.2.0 > Reporter: Holger Knublauch > Priority: Major > Attachments: TBS4190_Test.java > > > This seems to have been introduced with > https://issues.apache.org/jira/browse/JENA-2120 > With TDB databases that contain the replacement character in a literal, the > warnings are reported as Exceptions. We have seen this: > {code:java} > WARN [http-nio-8083-exec-10] g.e.SimpleDataFetcherExceptionHandler - > Exception while fetching data (/resources[0]/turtleSourceCode) : [line: 1, > col: 318] Unicode replacement character U+FFFD in string > org.apache.jena.riot.RiotParseException: [line: 1, col: 318] Unicode > replacement character U+FFFD in string > at > org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerRiotParseException.warning(ErrorHandlerFactory.java:367) > ~[jena-arq-4.2.0.jar:4.2.0] > at > org.apache.jena.riot.tokens.TokenizerText.warning(TokenizerText.java:1332) > ~[jena-arq-4.2.0.jar:4.2.0] > at > org.apache.jena.riot.tokens.TokenizerText.readString(TokenizerText.java:768) > ~[jena-arq-4.2.0.jar:4.2.0] > at > org.apache.jena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:238) > ~[jena-arq-4.2.0.jar:4.2.0] > at > org.apache.jena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:89) > ~[jena-arq-4.2.0.jar:4.2.0] > at > org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:119) > ~[jena-tdb-4.2.0.jar:4.2.0] > at org.apache.jena.tdb.lib.NodeLib.decode(NodeLib.java:118) > ~[jena-tdb-4.2.0.jar:4.2.0] > {code} > TDB seems to use the fallback error handler causing an exception to be thrown > instead of just printing the warning (to the log). > Richard says he believes a fix would be to change NodecSEE.createTokenizer(): > {code:java} > return TokenizerText.create() > .fromString(string) > .errorHandler(ErrorHandlerFactory.errorHandlerDetailed()) > .build(); > {code} > Is there any known work-around in 4.2.0? We cannot even query those triples > from the offending TDBs at the moment. -- This message was sent by Atlassian Jira (v8.3.4#803005)