[jira] [Commented] (JENA-2179) TDB throws Unicode Replacement Character exception while fetching data

Holger Knublauch (Jira) Wed, 06 Oct 2021 16:17:05 -0700


    [ 
https://issues.apache.org/jira/browse/JENA-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425267#comment-17425267
 ]


Holger Knublauch commented on JENA-2179:
----------------------------------------

The data was populated with a previous version of Jena, from TTL files that had 
this character (ITIL glossary, had some incorrect literals in it). Anyone who 
has the (old) TopBraid samples installed will now have this corrupted data and 
it would crash when they upgrade.

Before upgrading to Jena 4.2.0 this was OK because no such checks were 
happening.

The fix mentioned above seems to work, and I have just tested it on our product 
where we now use a Reflection hack to overwrite the private field NodeLib.nodec 
with a fixed version of that class that only differs on 
NodecSEE.createTokenizer() as above.

The change for JENA-2120 seems well-motivated and well-intended as explained in 
the comment above TokenizerText.warning:     /** Warning - can continue. */  
However, from the context of NodeLib it doesn't produce a warning only but an 
Exception. It would cause similar issues for any other warnings that are 
reported.

Sorry we are in the middle of a release crunch so I don't have time to look 
into a formal Jena test case.


> TDB throws Unicode Replacement Character exception while fetching data
> ----------------------------------------------------------------------
>
>                 Key: JENA-2179
>                 URL: https://issues.apache.org/jira/browse/JENA-2179
>             Project: Apache Jena
>          Issue Type: Bug
>          Components: TDB
>    Affects Versions: Jena 4.2.0
>            Reporter: Holger Knublauch
>            Priority: Major
>
> This seems to have been introduced with 
> https://issues.apache.org/jira/browse/JENA-2120
> With TDB databases that contain the replacement character in a literal, the 
> warnings are reported as Exceptions. We have seen this:
> {code:java}
> WARN  [http-nio-8083-exec-10] g.e.SimpleDataFetcherExceptionHandler - 
> Exception while fetching data (/resources[0]/turtleSourceCode) : [line: 1, 
> col: 318] Unicode replacement character U+FFFD in string
> org.apache.jena.riot.RiotParseException: [line: 1, col: 318] Unicode 
> replacement character U+FFFD in string
>       at 
> org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerRiotParseException.warning(ErrorHandlerFactory.java:367)
>  ~[jena-arq-4.2.0.jar:4.2.0]
>       at 
> org.apache.jena.riot.tokens.TokenizerText.warning(TokenizerText.java:1332) 
> ~[jena-arq-4.2.0.jar:4.2.0]
>       at 
> org.apache.jena.riot.tokens.TokenizerText.readString(TokenizerText.java:768) 
> ~[jena-arq-4.2.0.jar:4.2.0]
>       at 
> org.apache.jena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:238) 
> ~[jena-arq-4.2.0.jar:4.2.0]
>       at 
> org.apache.jena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:89) 
> ~[jena-arq-4.2.0.jar:4.2.0]
>       at 
> org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:119) 
> ~[jena-tdb-4.2.0.jar:4.2.0]
>       at org.apache.jena.tdb.lib.NodeLib.decode(NodeLib.java:118) 
> ~[jena-tdb-4.2.0.jar:4.2.0]
> {code}
> TDB seems to use the fallback error handler causing an exception to be thrown 
> instead of just printing the warning (to the log).
> Richard says he believes a fix would be to change NodecSEE.createTokenizer():
> {code:java}
> return TokenizerText.create()
>     .fromString(string)
>     .errorHandler(ErrorHandlerFactory.errorHandlerDetailed())
>     .build();
> {code}
> Is there any known work-around in 4.2.0? We cannot even query those triples 
> from the offending TDBs at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (JENA-2179) TDB throws Unicode Replacement Character exception while fetching data

Reply via email to