I suspect multistream bzip2 is the culprit (which is a sensible
correlation with parallel bzip).

For what it's worth Python 2.x can't read these files either.  There's
a backport of the 3.x support, but it requires installing a separate
package.

Tom

On Wed, Mar 20, 2013 at 9:48 PM, Jona Christopher Sahnwaldt
<j...@sahnwaldt.de> wrote:
> On 20 March 2013 20:10, Andrea Di Menna <ninn...@gmail.com> wrote:
>> Hi Jona,
>>
>> I have tried loading labels_en_uris_de.nt.bz2 from the DBpedia 3.8 release
>> using both Jena 2.7.4 and 2.10.0, but both fail with the following error:
>>
>> andread@build04:~/tools/apache-jena-2.10.0/bin$ ./tdbloader2 --loc .
>> /media/HD2/data/dbpedia-3.8-archive/source_data/labels_en_uris_de.nt.bz2
>>  19:48:02 -- TDB Bulk Loader Start
>>  19:48:02 Data phase
>> INFO  Load:
>> /media/HD2/data/dbpedia-3.8-archive/source_data/labels_en_uris_de.nt.bz2 --
>> 2013/03/20 19:48:03 CET
>> Exception in thread "main" org.apache.jena.atlas.AtlasException:
>> java.nio.charset.MalformedInputException: Input length = 1
>>     at org.apache.jena.atlas.io.IO.exception(IO.java:154)
>>     at
>> org.apache.jena.atlas.io.CharStreamBuffered$SourceReader.fill(CharStreamBuffered.java:79)
>>     at
>> org.apache.jena.atlas.io.CharStreamBuffered.fillArray(CharStreamBuffered.java:156)
>>     at
>> org.apache.jena.atlas.io.CharStreamBuffered.advance(CharStreamBuffered.java:139)
>>     at
>> org.apache.jena.atlas.io.PeekReader.advanceAndSet(PeekReader.java:251)
>>     at org.apache.jena.atlas.io.PeekReader.init(PeekReader.java:244)
>>     at org.apache.jena.atlas.io.PeekReader.peekChar(PeekReader.java:169)
>>     at org.apache.jena.atlas.io.PeekReader.makeUTF8(PeekReader.java:108)
>>     at
>> org.apache.jena.riot.tokens.TokenizerFactory.makeTokenizerUTF8(TokenizerFactory.java:41)
>>     at org.apache.jena.riot.RiotReader.createParser(RiotReader.java:130)
>>     at org.apache.jena.riot.RiotReader.parse(RiotReader.java:115)
>>     at org.apache.jena.riot.RiotReader.parse(RiotReader.java:93)
>>     at org.apache.jena.riot.RiotReader.parse(RiotReader.java:66)
>>     at
>> com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:162)
>>     at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
>>     at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
>>     at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
>>     at
>> com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:80)
>> Caused by: java.nio.charset.MalformedInputException: Input length = 1
>>     at java.nio.charset.CoderResult.throwException(CoderResult.java:277)
>>     at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338)
>>     at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
>>     at java.io.InputStreamReader.read(InputStreamReader.java:184)
>>     at java.io.Reader.read(Reader.java:140)
>>     ... 17 more
>>
>> Anyway, I have now tried the following:
>>
>> 1) Download german labels
>> 2) Run tdbloader2 on the bz2 nt file -> failure
>> 3) Uncompress the bz2 file and run tdbloader2 -> SUCCESS
>> 4) Compress the nt file again -> failure
>>
>> Looks like Jena is having some problems with bz2 files then.
>
> Interesting.
>
> Since 3.8, we use parallel bzip2 [1] to compress the files (it's much
> faster on multi-core machines). The files created by pbzip2 have a
> slightly different format though. Legal for bzip2, but for example
> older versions of Commons Compress cannot deal with it [2][3].
>
>> 2) Run tdbloader2 on the bz2 nt file -> failure
>> 3) Uncompress the bz2 file and run tdbloader2 -> SUCCESS
>
> This very much looks like compression is the culprit, not DBpedia encoding.
>
>> 4) Compress the nt file again -> failure
>
> This is a bit weird. How do you compress the file?
>
> Cheers,
> JC
>
> [1] http://compression.ca/pbzip2/
> [2] https://issues.apache.org/jira/browse/COMPRESS-146
> [3] https://issues.apache.org/jira/browse/COMPRESS-162
>
>> Would you mind giving it a try?
>>
>> But anyway please check this JIRA issue out
>> https://issues.apache.org/jira/browse/STANBOL-804
>>
>> Cheers
>> Andrea
>>
>>
>> 2013/3/20 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
>>>
>>> Hi Andrea,
>>>
>>> there used to be encoding problems, but I think they are all fixed
>>> since the 3.8 release. I tried very hard to make TurtleEscaper do the
>>> right thing - I checked the relevant standards etc. Could you give an
>>> example where Jena complains about a DBpedia 3.8 file?
>>>
>>> Cheers,
>>> JC
>>>
>>> On Wed, Mar 20, 2013 at 6:16 PM, Andrea Di Menna <ninn...@gmail.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > I have been using Stanbol [1] to process DBpedia data files and build a
>>> > dbpedia Solr index.
>>> > Stanbol is using Jena TDB in order to load DBpedia files into a triple
>>> > store.
>>> > Unfortunately, almost all the DBpedia N-Triples files must be
>>> > pre-processed
>>> > before being able to import them using Jena [2].
>>> >
>>> > The following sed command is launched:
>>> >
>>> > sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g'
>>> >
>>> > Basically the backslash is replaced with the unicode character escape
>>> > sequence.
>>> >
>>> > Do you think this should/could be fixed in
>>> > org.dbpedia.extraction.util.TurtleEscaper#escapeTurtle ?
>>> >
>>> > Cheers
>>> > Andrea
>>> >
>>> > [1] http://stanbol.apache.org/
>>> > [2]
>>> >
>>> > http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh
>>> >
>>> >
>>> > ------------------------------------------------------------------------------
>>> > Everyone hates slow websites. So do we.
>>> > Make your web apps faster with AppDynamics
>>> > Download AppDynamics Lite for free today:
>>> > http://p.sf.net/sfu/appdyn_d2d_mar
>>> > _______________________________________________
>>> > Dbpedia-discussion mailing list
>>> > Dbpedia-discussion@lists.sourceforge.net
>>> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>> >
>>
>>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_mar
> _______________________________________________
> Dbpedia-discussion mailing list
> Dbpedia-discussion@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to