[
https://issues.apache.org/jira/browse/JENA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234888#comment-13234888
]
Andy Seaborne edited comment on JENA-225 at 3/21/12 7:39 PM:
-------------------------------------------------------------
This issue is not related to transactions per se. Normally, node caching hides
the fact the DB has corrupted by illegal UTF-8.
The transaction system just happens to highlight the problem as it works
without high-level node caches to make the actions idempotent.
The attached file shows it can happen for a raw storage dataset. The code
resets the system storage cache to remove all node table caches.
Also in teh code snippet, print out the size f the byte buffer after 'encode'
and it will show it is short.
The problem is in the encoding of chars to bytes. The jaba.nio.Charset encoder
needs "onMalformedInput(CodingErrorAction.REPLACE)" to be set and it isn't.
This replaces the bad uniocde codepoint (high surrogate without a following low
surrogate to make a surrorgate pair) with a '?' character -- standard Java
charset decoder.
A more ambitious fix is to not use Java encoder/decoders, which are sensitive
to codepoint-legality, and drop down to custom code that only uses UTF-8
encoding rules without checking for legal codepoints. This would make TDB
robust though something else may break when the data is leaves the JVM and is
read in elsewhere because the data is not legal unicode.
Class BlockUTF8 is code to do String<->ByteBuffers. Classes InStreamUTF8 and
OutStreamUTF8 in ARQ are the UTf-8 algorithm over input and output streams The
latter are slightly slower (a few percent) than the standard java encoders when
used in RIOT on large files needing multiple seconds decoding time.
Differences in speed will only show in TDB on very large literals (100k+ ?).
Normally, lexical forms are less than a few 100 bytes and the difference is not
measurable (the custom codec process may even be faster due to lower startup
costs). It is well below the rest of the database processing costs.
was (Author: andy.seaborne):
This issue is not related to transactions per se. Normally, node caching
hides the fact the DB has corrupted by illegal UTF-8.
The transaction system just happens to highlight the problem as it works
without high-level node caches to make the actions idempotent.
The attached file shows it can happen for a raw storage dataset. The code
resets the system storage cache to remove all node table caches.
Also in teh code snippet, print out the size f the byte buffer after 'encode'
and it will show it is short.
The problem is in the encoding of chars to bytes. The jaba.nio.Charset encoder
needs "onMalformedInput(CodingErrorAction.REPLACE)" to be set and it isn't.
This replaces the bad uniocde codepoint (high surrogate without a following low
surrogate to make a surrorgate pair) with a '?' character -- standard Java
charset decoder.
A more ambitious fix is to not use Java encoder/decoders, which are sensitive
to codepoint-legality, and drop down to custom code that only uses UTF-8
encoding rules without checking for legal codepoints. This would make TDB
robust though something else may break when the data is leaves the JVM and is
read in elsewhere because the data is not legal unicode.
Classes InStreamUTF8 and OutStreamUTF8 in ARQ show the encoding algorithm.
They are slightly slower (a few percent) than the standard java encoders when
used in RIOT on large files needing multiple seconds decoding time. It wil only
show in TDB on very large literals (100k+ ?). Normally, lexical forms are less
than a few 100 bytes and the difference is not measurable (the custom codec
process may even be faster due to lower startup costs). It is well below the
rest of the database processing costs.
> TDB datasets can be corrupted by performing certain operations within a
> transaction
> ------------------------------------------------------------------------------------
>
> Key: JENA-225
> URL: https://issues.apache.org/jira/browse/JENA-225
> Project: Apache Jena
> Issue Type: Bug
> Affects Versions: TDB 0.9.0
> Environment: jena-tdb-0.9.0-incubating
> Reporter: Sam Tunnicliffe
> Attachments: ReportBadUnicode1.java
>
>
> In a web application, we read some triples in a HTTP POST, using a LangTurtle
> instance and a tokenizer obtained from from
> TokenizerFactory.makeTokenizerUTF8.
> We then write the parsed Triples back out (to temporary storage) using
> OutputLangUtils.write. At some later time, these Triples are then re-read,
> again using a Tokenizer from TokenizerFactory.makeTokenizerUTF8, before being
> inserted into a TDB dataset.
> We have found it possible for the the input data to contain character strings
> which pass through the various parsers/serializers but which cause TDB's
> transaction layer to error in such a way as to make recovery from journals
> ineffective.
> Eliminating transactions from the code path enables the database to be
> updated successfully.
> The stacktrace from TDB looks like this:
> org.openjena.riot.RiotParseException: [line: 1, col: 2 ] Broken token: Hello
> at
> org.openjena.riot.tokens.TokenizerText.exception(TokenizerText.java:1209)
> at
> org.openjena.riot.tokens.TokenizerText.readString(TokenizerText.java:620)
> at
> org.openjena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:248)
> at
> org.openjena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:112)
> at com.hp.hpl.jena.tdb.nodetable.NodecSSE.decode(NodecSSE.java:105)
> at com.hp.hpl.jena.tdb.lib.NodeLib.decode(NodeLib.java:93)
> at
> com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:234)
> at
> com.hp.hpl.jena.tdb.nodetable.NodeTableNative$2.convert(NodeTableNative.java:228)
> at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
> at
> com.hp.hpl.jena.tdb.transaction.NodeTableTrans.append(NodeTableTrans.java:188)
> at
> com.hp.hpl.jena.tdb.transaction.NodeTableTrans.writeNodeJournal(NodeTableTrans.java:306)
> at
> com.hp.hpl.jena.tdb.transaction.NodeTableTrans.commitPrepare(NodeTableTrans.java:266)
> at
> com.hp.hpl.jena.tdb.transaction.Transaction.prepare(Transaction.java:131)
> at
> com.hp.hpl.jena.tdb.transaction.Transaction.commit(Transaction.java:112)
> at
> com.hp.hpl.jena.tdb.transaction.DatasetGraphTxn.commit(DatasetGraphTxn.java:40)
> at
> com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction._commit(DatasetGraphTransaction.java:106)
> at
> com.hp.hpl.jena.tdb.migrate.DatasetGraphTrackActive.commit(DatasetGraphTrackActive.java:60)
> at com.hp.hpl.jena.sparql.core.DatasetImpl.commit(DatasetImpl.java:143)
> At least part of the issue seems to be stem from NodecSSE (I know this isn't
> actual unicode escaping, but its derived from the user input we've received).
> String s = "Hello \uDAE0 World";
> Node literal = Node.createLiteral(s);
> ByteBuffer bb = NodeLib.encode(literal);
> NodeLib.decode(bb);
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira