Re: Jena riot --validate : I can't spot the errror

jerven Bolleman Wed, 16 Jun 2021 07:37:45 -0700

Hi Andy,

I found the bad line eventually with

grep -a -v -n -P "^[A-Za-z0-9\"$\:\,\|$\;\.\-\=\+\/\'\[\]\\\\_\?\@\<\>\#\*]+$" swisslipids.ttl


5796196:  rdfs:label "�-D-galactosyl-(11')-sphing-4-enine" ;

That was quite a distance away from the reported line

Where the ? is a badly written
https://en.wikipedia.org/wiki/%C3%9F
in the upstream data set.

head -n 5796196 swisslipids.ttl| tail -n 1 | hexdump -C
                r  d  f  s  :  l   a  b  e  l     "  <bad>
00000000  20 20 72 64 66 73 3a 6c  61 62 65 6c 20 22 df 2d
         |  rdfs:label ".-|
00000010  44 2d 67 61 6c 61 63 74  6f 73 79 6c 2d 28 31 31
         |D-galactosyl-(11|
00000020  27 29 2d 73 70 68 69 6e  67 2d 34 2d 65 6e 69 6e
         |')-sphing-4-enin|
00000030  65 22 20 3b 20 0a
         |e" ; .|

The upstream code has some logic to replace some unicode into "ascii"expecting beta etc. but not a german double s.

Thanks again for having such a nice set of strict tools. It really helpsus data publishers not to put out "crap" by accident.


Regards,
Jerven


On 15/06/2021 11:52, Andy Seaborne wrote:

In the SO report, the data has 3 byte encoded surrogate pairs

https://unicode.org/faq/utf_bom.html#utf8-4

This is illegal in UTF-8.  Java (as of Java8) enforces this.
Jena was not handling this as gracefully as it might (it can't fix theproblem without replacing JDK charset decoding).
JENA-2118 changes handling to put in the Unicode replacement characterbut the data is changed.
Jerven - this might or might not be what your data has in it. Be good toknow if it is.
     Andy


On 14/06/2021 20:25, Andy Seaborne wrote:
On 14/06/2021 18:56, Andy Seaborne wrote:
PS Try Jena 4.1.0.
PPS Spoke too soon. 4.1.0 does not handle surrogate unless presentedas explicit unicode escapes: \u or \U .
     Andy
There were changes relating to:

* Handing of Unicode surrogate pairs
* not being sensitive to the Unicode version supported by the Javaruntime platform.
Depending on how you use the data, it may output as one or two U+FFFD("unmappable") characters but that's an output issue.
     Andy

On 14/06/2021 15:50, Andy Seaborne wrote:
Hi Jerven,
This is java.nio.charset.MalformedInputException wrapped up.Unfortunately, the line/col aren't right because the standard Javadecoder does not provide the information. It will be after thereported point in the input so slicing the front off should get youa smaller file to look at, then binary chop to find a slicecontaining the error.
For efficiency reasons (and it does make a notable difference), RIOTgrabs large chunks of characters from the UTF-8 => java charactersdecoder (128K chunks). Down side - encoding errors are reported as"somewhere" and can be anywhere in the chunk.
There is a Jena tool "utf8" which more carefully translates UTF-8 -it may help pinpoint the error. It's better in 4.1.0: in 4.0.0 itgot left in verbose mode - it needs a small slice of data and it maynot find the error helpfully because of the different ways it can fail.
I'd be interested in knowing whether mapping to "unmappable" U+FFFDwould help - but it's a silent translation so not a perfect solution.
     Andy

Is this connected with a Q today:
https://stackoverflow.com/questions/67970538/is-it-possible-to-ignore-riotparseexception-in-apache-jena
On 14/06/2021 14:59, jerven Bolleman wrote:
Dear Jena team,

I have a turtle file that fails validation with the following error.

riot --validate swisslipids.ttl
15:19:08 ERROR riot :: [line: 5794892, col: 6 ] Badcharacter encoding
But I can't spot the error on that line so I did a hexdump.


sed -n '5794892p' swisslipids.ttl | hexdump -C
00000000 20 20 53 4c 4d 3a 72 61 6e 6b 20 53 4c 4d 3a 49 |SLM:rank SLM:I|00000010 73 6f 6d 65 72 69 63 5f 53 75 62 73 70 65 63 69|someric_Subspeci|
00000020  65 73 20 3b 0a                                    |es ;.|
00000025

Which is the same as a different earlier line which passed

sed -n '5794877p' swisslipids.ttl | hexdump -C
00000000 20 20 53 4c 4d 3a 72 61 6e 6b 20 53 4c 4d 3a 49 |SLM:rank SLM:I|00000010 73 6f 6d 65 72 69 63 5f 53 75 62 73 70 65 63 69|someric_Subspeci|
00000020  65 73 20 3b 0a                                    |es ;.|
00000025
The file is unfortunatly to large to attach at 29MB of xzcompressed data. I would be more than happy to share it or a subset.
Riot version is 4.0.0

Regards,
Jerven

PS nearby lines with their line numbers

5794876 SLM:000501095 a owl:Class ;
5794877   SLM:rank SLM:Isomeric_Subspecies ;
5794878 rdfs:label"(12S)-hydroperoxy-(5Z,8Z,10E,14Z,17Z)-eicosapentaenoate" ;
5794879   skos:altLabel "(12S)-Hp-(5Z,8Z,10E,14Z,17Z)-EPE" ;
5794880   rdfs:subClassOf SLM:000501324 ;
5794881 chebislash:inchi"InChI=1S/C20H30O4/c1-2-3-4-5-10-13-16-19(24-23)17-14-11-8-6-7-9-12-15-18-20(21)22/h3-4,7-11,13-14,17,19,23H,2,5-6,12,15-16,18H2,1H3,(H,21,22)/p-1/b4-3-,9-7-,11-8-,13-10-,17-14+/t19-/m0/s1";
5794882   chebislash:inchikey "HDMYXONNVAOHFR-UOLHMMFFSA-M" ;
5794883   owl:equivalentClass
5794884  CHEBI:90772
5794885  ;
5794886   rdfs:seeAlso lipidmaps:LMFA03070012 ;
5794887   chebislash:charge "-1" ;
5794888 chebislash:smiles'''C(=C\\C\\C=C/C=C/[C@H](C/C=C\\C/C=C\\CC)OO)\\CCCC([O-])=O''' ;
5794889   SLM:citation citation:22984144 ;
5794890   chebislash:formula "C20H29O4" .
5794891 SLM:000501145 a owl:Class ;
5794892   SLM:rank SLM:Isomeric_Subspecies ;


--

        *Jerven Tjalling Bolleman*
Principal Software Developer
*SIB | Swiss Institute of Bioinformatics*
1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland
t +41 22 379 58 85
[email protected] - www.sib.swiss

Re: Jena riot --validate : I can't spot the errror

Reply via email to