I suspect multistream bzip2 is the culprit (which is a sensible correlation with parallel bzip).
For what it's worth Python 2.x can't read these files either. There's a backport of the 3.x support, but it requires installing a separate package. Tom On Wed, Mar 20, 2013 at 9:48 PM, Jona Christopher Sahnwaldt <j...@sahnwaldt.de> wrote: > On 20 March 2013 20:10, Andrea Di Menna <ninn...@gmail.com> wrote: >> Hi Jona, >> >> I have tried loading labels_en_uris_de.nt.bz2 from the DBpedia 3.8 release >> using both Jena 2.7.4 and 2.10.0, but both fail with the following error: >> >> andread@build04:~/tools/apache-jena-2.10.0/bin$ ./tdbloader2 --loc . >> /media/HD2/data/dbpedia-3.8-archive/source_data/labels_en_uris_de.nt.bz2 >> 19:48:02 -- TDB Bulk Loader Start >> 19:48:02 Data phase >> INFO Load: >> /media/HD2/data/dbpedia-3.8-archive/source_data/labels_en_uris_de.nt.bz2 -- >> 2013/03/20 19:48:03 CET >> Exception in thread "main" org.apache.jena.atlas.AtlasException: >> java.nio.charset.MalformedInputException: Input length = 1 >> at org.apache.jena.atlas.io.IO.exception(IO.java:154) >> at >> org.apache.jena.atlas.io.CharStreamBuffered$SourceReader.fill(CharStreamBuffered.java:79) >> at >> org.apache.jena.atlas.io.CharStreamBuffered.fillArray(CharStreamBuffered.java:156) >> at >> org.apache.jena.atlas.io.CharStreamBuffered.advance(CharStreamBuffered.java:139) >> at >> org.apache.jena.atlas.io.PeekReader.advanceAndSet(PeekReader.java:251) >> at org.apache.jena.atlas.io.PeekReader.init(PeekReader.java:244) >> at org.apache.jena.atlas.io.PeekReader.peekChar(PeekReader.java:169) >> at org.apache.jena.atlas.io.PeekReader.makeUTF8(PeekReader.java:108) >> at >> org.apache.jena.riot.tokens.TokenizerFactory.makeTokenizerUTF8(TokenizerFactory.java:41) >> at org.apache.jena.riot.RiotReader.createParser(RiotReader.java:130) >> at org.apache.jena.riot.RiotReader.parse(RiotReader.java:115) >> at org.apache.jena.riot.RiotReader.parse(RiotReader.java:93) >> at org.apache.jena.riot.RiotReader.parse(RiotReader.java:66) >> at >> com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:162) >> at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101) >> at arq.cmdline.CmdMain.mainRun(CmdMain.java:63) >> at arq.cmdline.CmdMain.mainRun(CmdMain.java:50) >> at >> com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:80) >> Caused by: java.nio.charset.MalformedInputException: Input length = 1 >> at java.nio.charset.CoderResult.throwException(CoderResult.java:277) >> at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338) >> at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177) >> at java.io.InputStreamReader.read(InputStreamReader.java:184) >> at java.io.Reader.read(Reader.java:140) >> ... 17 more >> >> Anyway, I have now tried the following: >> >> 1) Download german labels >> 2) Run tdbloader2 on the bz2 nt file -> failure >> 3) Uncompress the bz2 file and run tdbloader2 -> SUCCESS >> 4) Compress the nt file again -> failure >> >> Looks like Jena is having some problems with bz2 files then. > > Interesting. > > Since 3.8, we use parallel bzip2 [1] to compress the files (it's much > faster on multi-core machines). The files created by pbzip2 have a > slightly different format though. Legal for bzip2, but for example > older versions of Commons Compress cannot deal with it [2][3]. > >> 2) Run tdbloader2 on the bz2 nt file -> failure >> 3) Uncompress the bz2 file and run tdbloader2 -> SUCCESS > > This very much looks like compression is the culprit, not DBpedia encoding. > >> 4) Compress the nt file again -> failure > > This is a bit weird. How do you compress the file? > > Cheers, > JC > > [1] http://compression.ca/pbzip2/ > [2] https://issues.apache.org/jira/browse/COMPRESS-146 > [3] https://issues.apache.org/jira/browse/COMPRESS-162 > >> Would you mind giving it a try? >> >> But anyway please check this JIRA issue out >> https://issues.apache.org/jira/browse/STANBOL-804 >> >> Cheers >> Andrea >> >> >> 2013/3/20 Jona Christopher Sahnwaldt <j...@sahnwaldt.de> >>> >>> Hi Andrea, >>> >>> there used to be encoding problems, but I think they are all fixed >>> since the 3.8 release. I tried very hard to make TurtleEscaper do the >>> right thing - I checked the relevant standards etc. Could you give an >>> example where Jena complains about a DBpedia 3.8 file? >>> >>> Cheers, >>> JC >>> >>> On Wed, Mar 20, 2013 at 6:16 PM, Andrea Di Menna <ninn...@gmail.com> >>> wrote: >>> > Hi, >>> > >>> > I have been using Stanbol [1] to process DBpedia data files and build a >>> > dbpedia Solr index. >>> > Stanbol is using Jena TDB in order to load DBpedia files into a triple >>> > store. >>> > Unfortunately, almost all the DBpedia N-Triples files must be >>> > pre-processed >>> > before being able to import them using Jena [2]. >>> > >>> > The following sed command is launched: >>> > >>> > sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' >>> > >>> > Basically the backslash is replaced with the unicode character escape >>> > sequence. >>> > >>> > Do you think this should/could be fixed in >>> > org.dbpedia.extraction.util.TurtleEscaper#escapeTurtle ? >>> > >>> > Cheers >>> > Andrea >>> > >>> > [1] http://stanbol.apache.org/ >>> > [2] >>> > >>> > http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh >>> > >>> > >>> > ------------------------------------------------------------------------------ >>> > Everyone hates slow websites. So do we. >>> > Make your web apps faster with AppDynamics >>> > Download AppDynamics Lite for free today: >>> > http://p.sf.net/sfu/appdyn_d2d_mar >>> > _______________________________________________ >>> > Dbpedia-discussion mailing list >>> > Dbpedia-discussion@lists.sourceforge.net >>> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion >>> > >> >> > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_mar > _______________________________________________ > Dbpedia-discussion mailing list > Dbpedia-discussion@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion