Hi Andrea, Rupert commited a change: http://svn.apache.org/r1459296 (Wow! Fast response!)
Andrea: maybe you could try the SVN version and tell us if it works? Cheers, JC On 21 March 2013 12:08, Jona Christopher Sahnwaldt <j...@sahnwaldt.de> wrote: > Hi Andrea, Rupert, > > Rupert, maybe you can help. Summary: DBpedia backslash escaping is > (most likely) correct since 3.8. Stanbol / Jena can read the DBpedia > 3.8 files fine if they are uncompressed first. It looks like Stanbol > has a problem with bz2. > > https://issues.apache.org/jira/browse/STANBOL-804 > http://markmail.org/message/67ivlyoxfqad6xoe > > Cheers, > JC > > On 21 March 2013 10:20, Andrea Di Menna <ninn...@gmail.com> wrote: >> Hi Jona, >> >> I compressed the nt file with bzip2 >> >> andread@build04:~/tools/apache-jena-2.7.4/bin$ bzip2 --version >> bzip2, a block-sorting file compressor. Version 1.0.6, 6-Sept-2010. >> >> Copyright (C) 1996-2010 by Julian Seward. >> >> This program is free software; you can redistribute it and/or modify >> it under the terms set out in the LICENSE file, which is included >> in the bzip2-1.0.6 source distribution. >> >> This program is distributed in the hope that it will be useful, >> but WITHOUT ANY WARRANTY; without even the implied warranty of >> MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the >> LICENSE file for more details. >> >> Also, I now tried with the same file mentioned in the JIRA bug [1], using >> both Jena 2.7.4 and 2.10.0 tdbloader2, and got the following: >> >> 1) Same exception as below when running on bz2 file >> 2) No exception with uncompressed nt file >> >> But I remember seeing the same exceptions as the ones in the JIRA issue when >> using Stanbol indexing tool (which is building a TDB from source RDF files, >> before building the Solr index). >> It is likely then that the Stanbol code is not acting as the tdbloader2 when >> processing RDF files. >> >> WDYT? >> >> Cheers >> Andrea >> >> [1] http://downloads.dbpedia.org/3.8/es/redirects_es.nt.bz2 >> >> >> 2013/3/21 Jona Christopher Sahnwaldt <j...@sahnwaldt.de> >>> >>> On 20 March 2013 20:10, Andrea Di Menna <ninn...@gmail.com> wrote: >>> > Hi Jona, >>> > >>> > I have tried loading labels_en_uris_de.nt.bz2 from the DBpedia 3.8 >>> > release >>> > using both Jena 2.7.4 and 2.10.0, but both fail with the following >>> > error: >>> > >>> > andread@build04:~/tools/apache-jena-2.10.0/bin$ ./tdbloader2 --loc . >>> > /media/HD2/data/dbpedia-3.8-archive/source_data/labels_en_uris_de.nt.bz2 >>> > 19:48:02 -- TDB Bulk Loader Start >>> > 19:48:02 Data phase >>> > INFO Load: >>> > /media/HD2/data/dbpedia-3.8-archive/source_data/labels_en_uris_de.nt.bz2 >>> > -- >>> > 2013/03/20 19:48:03 CET >>> > Exception in thread "main" org.apache.jena.atlas.AtlasException: >>> > java.nio.charset.MalformedInputException: Input length = 1 >>> > at org.apache.jena.atlas.io.IO.exception(IO.java:154) >>> > at >>> > >>> > org.apache.jena.atlas.io.CharStreamBuffered$SourceReader.fill(CharStreamBuffered.java:79) >>> > at >>> > >>> > org.apache.jena.atlas.io.CharStreamBuffered.fillArray(CharStreamBuffered.java:156) >>> > at >>> > >>> > org.apache.jena.atlas.io.CharStreamBuffered.advance(CharStreamBuffered.java:139) >>> > at >>> > org.apache.jena.atlas.io.PeekReader.advanceAndSet(PeekReader.java:251) >>> > at org.apache.jena.atlas.io.PeekReader.init(PeekReader.java:244) >>> > at org.apache.jena.atlas.io.PeekReader.peekChar(PeekReader.java:169) >>> > at org.apache.jena.atlas.io.PeekReader.makeUTF8(PeekReader.java:108) >>> > at >>> > >>> > org.apache.jena.riot.tokens.TokenizerFactory.makeTokenizerUTF8(TokenizerFactory.java:41) >>> > at org.apache.jena.riot.RiotReader.createParser(RiotReader.java:130) >>> > at org.apache.jena.riot.RiotReader.parse(RiotReader.java:115) >>> > at org.apache.jena.riot.RiotReader.parse(RiotReader.java:93) >>> > at org.apache.jena.riot.RiotReader.parse(RiotReader.java:66) >>> > at >>> > >>> > com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:162) >>> > at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101) >>> > at arq.cmdline.CmdMain.mainRun(CmdMain.java:63) >>> > at arq.cmdline.CmdMain.mainRun(CmdMain.java:50) >>> > at >>> > >>> > com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:80) >>> > Caused by: java.nio.charset.MalformedInputException: Input length = 1 >>> > at java.nio.charset.CoderResult.throwException(CoderResult.java:277) >>> > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338) >>> > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177) >>> > at java.io.InputStreamReader.read(InputStreamReader.java:184) >>> > at java.io.Reader.read(Reader.java:140) >>> > ... 17 more >>> > >>> > Anyway, I have now tried the following: >>> > >>> > 1) Download german labels >>> > 2) Run tdbloader2 on the bz2 nt file -> failure >>> > 3) Uncompress the bz2 file and run tdbloader2 -> SUCCESS >>> > 4) Compress the nt file again -> failure >>> > >>> > Looks like Jena is having some problems with bz2 files then. >>> >>> Interesting. >>> >>> Since 3.8, we use parallel bzip2 [1] to compress the files (it's much >>> faster on multi-core machines). The files created by pbzip2 have a >>> slightly different format though. Legal for bzip2, but for example >>> older versions of Commons Compress cannot deal with it [2][3]. >>> >>> > 2) Run tdbloader2 on the bz2 nt file -> failure >>> > 3) Uncompress the bz2 file and run tdbloader2 -> SUCCESS >>> >>> This very much looks like compression is the culprit, not DBpedia >>> encoding. >>> >>> > 4) Compress the nt file again -> failure >>> >>> This is a bit weird. How do you compress the file? >>> >>> >>> Cheers, >>> JC >>> >>> [1] http://compression.ca/pbzip2/ >>> [2] https://issues.apache.org/jira/browse/COMPRESS-146 >>> [3] https://issues.apache.org/jira/browse/COMPRESS-162 >>> >>> > Would you mind giving it a try? >>> > >>> > But anyway please check this JIRA issue out >>> > https://issues.apache.org/jira/browse/STANBOL-804 >>> > >>> > Cheers >>> > Andrea >>> > >>> > >>> > 2013/3/20 Jona Christopher Sahnwaldt <j...@sahnwaldt.de> >>> >> >>> >> Hi Andrea, >>> >> >>> >> there used to be encoding problems, but I think they are all fixed >>> >> since the 3.8 release. I tried very hard to make TurtleEscaper do the >>> >> right thing - I checked the relevant standards etc. Could you give an >>> >> example where Jena complains about a DBpedia 3.8 file? >>> >> >>> >> Cheers, >>> >> JC >>> >> >>> >> On Wed, Mar 20, 2013 at 6:16 PM, Andrea Di Menna <ninn...@gmail.com> >>> >> wrote: >>> >> > Hi, >>> >> > >>> >> > I have been using Stanbol [1] to process DBpedia data files and build >>> >> > a >>> >> > dbpedia Solr index. >>> >> > Stanbol is using Jena TDB in order to load DBpedia files into a >>> >> > triple >>> >> > store. >>> >> > Unfortunately, almost all the DBpedia N-Triples files must be >>> >> > pre-processed >>> >> > before being able to import them using Jena [2]. >>> >> > >>> >> > The following sed command is launched: >>> >> > >>> >> > sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' >>> >> > >>> >> > Basically the backslash is replaced with the unicode character escape >>> >> > sequence. >>> >> > >>> >> > Do you think this should/could be fixed in >>> >> > org.dbpedia.extraction.util.TurtleEscaper#escapeTurtle ? >>> >> > >>> >> > Cheers >>> >> > Andrea >>> >> > >>> >> > [1] http://stanbol.apache.org/ >>> >> > [2] >>> >> > >>> >> > >>> >> > http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh >>> >> > >>> >> > >>> >> > >>> >> > ------------------------------------------------------------------------------ >>> >> > Everyone hates slow websites. So do we. >>> >> > Make your web apps faster with AppDynamics >>> >> > Download AppDynamics Lite for free today: >>> >> > http://p.sf.net/sfu/appdyn_d2d_mar >>> >> > _______________________________________________ >>> >> > Dbpedia-discussion mailing list >>> >> > Dbpedia-discussion@lists.sourceforge.net >>> >> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion >>> >> > >>> > >>> > >> >> ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion