[ 
https://issues.apache.org/jira/browse/JENA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234888#comment-13234888
 ] 

Andy Seaborne edited comment on JENA-225 at 3/21/12 7:32 PM:
-------------------------------------------------------------

This issue is not related to transactions per se.  Normally, node caching hides 
the fact the DB has corrupted by illegal UTF-8.

The transaction system just happens to highlight the problem as it works 
without high-level node caches to make the actions idempotent.

The attached file shows it can happen for a raw storage dataset.  The code 
resets the system storage cache to remove all node table caches.

Also in teh code snippet, print out the size f the byte buffer after 'encode' 
and it will show it is short.

The problem is in the encoding of chars to bytes.  The jaba.nio.Charset encoder 
needs "onMalformedInput(CodingErrorAction.REPLACE)" to be set and it isn't.  
This replaces the bad uniocde codepoint (high surrogate without a following low 
surrogate to make a surrorgate pair) with a '?' character -- standard Java 
charset decoder.

A more ambitious fix is to not use Java encoder/decoders, which are sensitive 
to codepoint-legality, and drop down to custom code that only uses UTF-8 
encoding rules without checking for legal codepoints.   This would make TDB 
robust though something else may break when the data is leaves the JVM and is 
read in elsewhere because the data is not legal unicode.

Classes InStreamUTF8 and OutStreamUTF8 in ARQ show the encoding algorithm.  
They are slightly slower (a few percent) than the standard java encoders when 
used in RIOT on large files needing multiple seconds decoding time. It wil only 
show in TDB on very large literals (100k+ ?).  Normally, lexical forms are less 
than a few 100 bytes and the difference is not measurable (the custom codec 
process may even be faster due to lower startup costs).  It is well below the 
rest of the database processing costs.
                
      was (Author: andy.seaborne):
    This issue is not related to transactions per se.  Normally, node caching 
hides the fact the DB has corrupted by illegal UTF-8.

The transaction system just happens to highlight the problem as it works 
without high-level node caches to make the actions idempotent.

The attached file shows it can happen for a raw storage dataset.  The code 
resets the system storage cache to remove all node table caches.

Also in teh code snippet, print out the size f the byte buffer after 'encode' 
and it will show it is short.

The problem is in the encoding of chars to bytes.  The jaba.nio.Charset encoder 
needs "onMalformedInput(CodingErrorAction.REPLACE)" to be set and it isn't.  
This replaces the bad uniocde codepoint (high surrogate without a following low 
surrogate to make a surrorgate pair) with a '?' character -- standard Java 
charset decoder.

A more ambitious fix is to not use Java encoder/decoders, which are sensitive 
to codepoint-legality, and drop down to custom code that only uses UTF-8 
encoding rules without checking for legal codepoints.   This would make TDB 
robust though something else may break when the data is leaves the JVM and is 
read in elsewhere because the data is not legal unicode.

Classes InStreamUTF8 and OutStreamUTF8 in ARQ show the encoding algorithm.  
They are slightly slower (a few percent) than the standard java encoders (which 
are probably native code) when used in RIOT on large files needing multiple 
seconds decoding time. It wil only show in TDB on very large literals (100k+ 
?).  Normally, lexical forms are less than a few 100 bytes and the difference 
is not measurable (the custom codec process may even be faster due to lower 
startup costs).  It is well below the rest of the database processing costs.
                  
> TDB datasets can be corrupted by performing certain operations within a 
> transaction 
> ------------------------------------------------------------------------------------
>
>                 Key: JENA-225
>                 URL: https://issues.apache.org/jira/browse/JENA-225
>             Project: Apache Jena
>          Issue Type: Bug
>    Affects Versions: TDB 0.9.0
>         Environment: jena-tdb-0.9.0-incubating
>            Reporter: Sam Tunnicliffe
>         Attachments: ReportBadUnicode1.java
>
>
> In a web application, we read some triples in a HTTP POST, using a LangTurtle 
> instance and a tokenizer obtained from from 
> TokenizerFactory.makeTokenizerUTF8. 
> We then write the parsed Triples back out (to temporary storage) using 
> OutputLangUtils.write. At some later time, these Triples are then re-read, 
> again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being 
> inserted into a TDB dataset. 
> We have found it possible for the the input data to contain character strings 
> which pass through the various parsers/serializers but which cause TDB's 
> transaction layer to error in such a way as to make recovery from journals 
> ineffective. 
> Eliminating transactions from the code path enables the database to be 
> updated successfully.
> The stacktrace from TDB looks like this: 
> org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello 
>       at 
> org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
>       at 
> org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
>       at 
> org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
>       at 
> org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
>       at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
>       at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
>       at 
> com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
>       at 
> com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
>       at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
>       at 
> com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
>       at 
> com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
>       at 
> com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
>       at 
> com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
>       at 
> com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
>       at 
> com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
>       at 
> com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
>       at 
> com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
>       at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
> At least part of the issue seems to be stem from NodecSSE (I know this isn't 
> actual unicode escaping, but its derived from the user input we've received). 
> String s = "Hello \uDAE0 World";
> Node literal = Node.createLiteral(s);
> ByteBuffer bb = NodeLib.encode(literal);
> NodeLib.decode(bb);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to