[
https://issues.apache.org/jira/browse/JENA-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425267#comment-17425267
]
Holger Knublauch commented on JENA-2179:
----------------------------------------
The data was populated with a previous version of Jena, from TTL files that had
this character (ITIL glossary, had some incorrect literals in it). Anyone who
has the (old) TopBraid samples installed will now have this corrupted data and
it would crash when they upgrade.
Before upgrading to Jena 4.2.0 this was OK because no such checks were
happening.
The fix mentioned above seems to work, and I have just tested it on our product
where we now use a Reflection hack to overwrite the private field NodeLib.nodec
with a fixed version of that class that only differs on
NodecSEE.createTokenizer() as above.
The change for JENA-2120 seems well-motivated and well-intended as explained in
the comment above TokenizerText.warning: /** Warning - can continue. */
However, from the context of NodeLib it doesn't produce a warning only but an
Exception. It would cause similar issues for any other warnings that are
reported.
Sorry we are in the middle of a release crunch so I don't have time to look
into a formal Jena test case.
> TDB throws Unicode Replacement Character exception while fetching data
> ----------------------------------------------------------------------
>
> Key: JENA-2179
> URL: https://issues.apache.org/jira/browse/JENA-2179
> Project: Apache Jena
> Issue Type: Bug
> Components: TDB
> Affects Versions: Jena 4.2.0
> Reporter: Holger Knublauch
> Priority: Major
>
> This seems to have been introduced with
> https://issues.apache.org/jira/browse/JENA-2120
> With TDB databases that contain the replacement character in a literal, the
> warnings are reported as Exceptions. We have seen this:
> {code:java}
> WARN [http-nio-8083-exec-10] g.e.SimpleDataFetcherExceptionHandler -
> Exception while fetching data (/resources[0]/turtleSourceCode) : [line: 1,
> col: 318] Unicode replacement character U+FFFD in string
> org.apache.jena.riot.RiotParseException: [line: 1, col: 318] Unicode
> replacement character U+FFFD in string
> at
> org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerRiotParseException.warning(ErrorHandlerFactory.java:367)
> ~[jena-arq-4.2.0.jar:4.2.0]
> at
> org.apache.jena.riot.tokens.TokenizerText.warning(TokenizerText.java:1332)
> ~[jena-arq-4.2.0.jar:4.2.0]
> at
> org.apache.jena.riot.tokens.TokenizerText.readString(TokenizerText.java:768)
> ~[jena-arq-4.2.0.jar:4.2.0]
> at
> org.apache.jena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:238)
> ~[jena-arq-4.2.0.jar:4.2.0]
> at
> org.apache.jena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:89)
> ~[jena-arq-4.2.0.jar:4.2.0]
> at
> org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:119)
> ~[jena-tdb-4.2.0.jar:4.2.0]
> at org.apache.jena.tdb.lib.NodeLib.decode(NodeLib.java:118)
> ~[jena-tdb-4.2.0.jar:4.2.0]
> {code}
> TDB seems to use the fallback error handler causing an exception to be thrown
> instead of just printing the warning (to the log).
> Richard says he believes a fix would be to change NodecSEE.createTokenizer():
> {code:java}
> return TokenizerText.create()
> .fromString(string)
> .errorHandler(ErrorHandlerFactory.errorHandlerDetailed())
> .build();
> {code}
> Is there any known work-around in 4.2.0? We cannot even query those triples
> from the offending TDBs at the moment.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)