In the SO report, the data has 3 byte encoded surrogate pairs
https://unicode.org/faq/utf_bom.html#utf8-4
This is illegal in UTF-8. Java (as of Java8) enforces this.
Jena was not handling this as gracefully as it might (it can't fix the
problem without replacing JDK charset decoding).
JENA-2118 changes handling to put in the Unicode replacement character
but the data is changed.
Jerven - this might or might not be what your data has in it. Be good to
know if it is.
Andy
On 14/06/2021 20:25, Andy Seaborne wrote:
On 14/06/2021 18:56, Andy Seaborne wrote:
PS Try Jena 4.1.0.
PPS Spoke too soon. 4.1.0 does not handle surrogate unless presented as
explicit unicode escapes: \u or \U .
Andy
There were changes relating to:
* Handing of Unicode surrogate pairs
* not being sensitive to the Unicode version supported by the Java
runtime platform.
Depending on how you use the data, it may output as one or two U+FFFD
("unmappable") characters but that's an output issue.
Andy
On 14/06/2021 15:50, Andy Seaborne wrote:
Hi Jerven,
This is java.nio.charset.MalformedInputException wrapped up.
Unfortunately, the line/col aren't right because the standard Java
decoder does not provide the information. It will be after the
reported point in the input so slicing the front off should get you a
smaller file to look at, then binary chop to find a slice containing
the error.
For efficiency reasons (and it does make a notable difference), RIOT
grabs large chunks of characters from the UTF-8 => java characters
decoder (128K chunks). Down side - encoding errors are reported as
"somewhere" and can be anywhere in the chunk.
There is a Jena tool "utf8" which more carefully translates UTF-8 -
it may help pinpoint the error. It's better in 4.1.0: in 4.0.0 it got
left in verbose mode - it needs a small slice of data and it may not
find the error helpfully because of the different ways it can fail.
I'd be interested in knowing whether mapping to "unmappable" U+FFFD
would help - but it's a silent translation so not a perfect solution.
Andy
Is this connected with a Q today:
https://stackoverflow.com/questions/67970538/is-it-possible-to-ignore-riotparseexception-in-apache-jena
On 14/06/2021 14:59, jerven Bolleman wrote:
Dear Jena team,
I have a turtle file that fails validation with the following error.
riot --validate swisslipids.ttl
15:19:08 ERROR riot :: [line: 5794892, col: 6 ] Bad
character encoding
But I can't spot the error on that line so I did a hexdump.
sed -n '5794892p' swisslipids.ttl | hexdump -C
00000000 20 20 53 4c 4d 3a 72 61 6e 6b 20 53 4c 4d 3a 49 |
SLM:rank SLM:I|
00000010 73 6f 6d 65 72 69 63 5f 53 75 62 73 70 65 63 69
|someric_Subspeci|
00000020 65 73 20 3b 0a |es ;.|
00000025
Which is the same as a different earlier line which passed
sed -n '5794877p' swisslipids.ttl | hexdump -C
00000000 20 20 53 4c 4d 3a 72 61 6e 6b 20 53 4c 4d 3a 49 |
SLM:rank SLM:I|
00000010 73 6f 6d 65 72 69 63 5f 53 75 62 73 70 65 63 69
|someric_Subspeci|
00000020 65 73 20 3b 0a |es ;.|
00000025
The file is unfortunatly to large to attach at 29MB of xz compressed
data. I would be more than happy to share it or a subset.
Riot version is 4.0.0
Regards,
Jerven
PS nearby lines with their line numbers
5794876 SLM:000501095 a owl:Class ;
5794877 SLM:rank SLM:Isomeric_Subspecies ;
5794878 rdfs:label
"(12S)-hydroperoxy-(5Z,8Z,10E,14Z,17Z)-eicosapentaenoate" ;
5794879 skos:altLabel "(12S)-Hp-(5Z,8Z,10E,14Z,17Z)-EPE" ;
5794880 rdfs:subClassOf SLM:000501324 ;
5794881 chebislash:inchi
"InChI=1S/C20H30O4/c1-2-3-4-5-10-13-16-19(24-23)17-14-11-8-6-7-9-12-15-18-20(21)22/h3-4,7-11,13-14,17,19,23H,2,5-6,12,15-16,18H2,1H3,(H,21,22)/p-1/b4-3-,9-7-,11-8-,13-10-,17-14+/t19-/m0/s1"
;
5794882 chebislash:inchikey "HDMYXONNVAOHFR-UOLHMMFFSA-M" ;
5794883 owl:equivalentClass
5794884 CHEBI:90772
5794885 ;
5794886 rdfs:seeAlso lipidmaps:LMFA03070012 ;
5794887 chebislash:charge "-1" ;
5794888 chebislash:smiles
'''C(=C\\C\\C=C/C=C/[C@H](C/C=C\\C/C=C\\CC)OO)\\CCCC([O-])=O''' ;
5794889 SLM:citation citation:22984144 ;
5794890 chebislash:formula "C20H29O4" .
5794891 SLM:000501145 a owl:Class ;
5794892 SLM:rank SLM:Isomeric_Subspecies ;