Re: [Dbpedia-discussion] Backslash encoding for URIs

Jona Christopher Sahnwaldt Wed, 20 Mar 2013 18:50:14 -0700

On 20 March 2013 20:10, Andrea Di Menna <ninn...@gmail.com> wrote:
> Hi Jona,
>
> I have tried loading labels_en_uris_de.nt.bz2 from the DBpedia 3.8 release
> using both Jena 2.7.4 and 2.10.0, but both fail with the following error:
>
> andread@build04:~/tools/apache-jena-2.10.0/bin$ ./tdbloader2 --loc .
> /media/HD2/data/dbpedia-3.8-archive/source_data/labels_en_uris_de.nt.bz2
>  19:48:02 -- TDB Bulk Loader Start
>  19:48:02 Data phase
> INFO  Load:
> /media/HD2/data/dbpedia-3.8-archive/source_data/labels_en_uris_de.nt.bz2 --
> 2013/03/20 19:48:03 CET
> Exception in thread "main" org.apache.jena.atlas.AtlasException:
> java.nio.charset.MalformedInputException: Input length = 1
>     at org.apache.jena.atlas.io.IO.exception(IO.java:154)
>     at
> org.apache.jena.atlas.io.CharStreamBuffered$SourceReader.fill(CharStreamBuffered.java:79)
>     at
> org.apache.jena.atlas.io.CharStreamBuffered.fillArray(CharStreamBuffered.java:156)
>     at
> org.apache.jena.atlas.io.CharStreamBuffered.advance(CharStreamBuffered.java:139)
>     at
> org.apache.jena.atlas.io.PeekReader.advanceAndSet(PeekReader.java:251)
>     at org.apache.jena.atlas.io.PeekReader.init(PeekReader.java:244)
>     at org.apache.jena.atlas.io.PeekReader.peekChar(PeekReader.java:169)
>     at org.apache.jena.atlas.io.PeekReader.makeUTF8(PeekReader.java:108)
>     at
> org.apache.jena.riot.tokens.TokenizerFactory.makeTokenizerUTF8(TokenizerFactory.java:41)
>     at org.apache.jena.riot.RiotReader.createParser(RiotReader.java:130)
>     at org.apache.jena.riot.RiotReader.parse(RiotReader.java:115)
>     at org.apache.jena.riot.RiotReader.parse(RiotReader.java:93)
>     at org.apache.jena.riot.RiotReader.parse(RiotReader.java:66)
>     at
> com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:162)
>     at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
>     at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
>     at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
>     at
> com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:80)
> Caused by: java.nio.charset.MalformedInputException: Input length = 1
>     at java.nio.charset.CoderResult.throwException(CoderResult.java:277)
>     at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338)
>     at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
>     at java.io.InputStreamReader.read(InputStreamReader.java:184)
>     at java.io.Reader.read(Reader.java:140)
>     ... 17 more
>
> Anyway, I have now tried the following:
>
> 1) Download german labels
> 2) Run tdbloader2 on the bz2 nt file -> failure
> 3) Uncompress the bz2 file and run tdbloader2 -> SUCCESS
> 4) Compress the nt file again -> failure
>
> Looks like Jena is having some problems with bz2 files then.


Interesting.

Since 3.8, we use parallel bzip2 [1] to compress the files (it's much
faster on multi-core machines). The files created by pbzip2 have a
slightly different format though. Legal for bzip2, but for example
older versions of Commons Compress cannot deal with it [2][3].

> 2) Run tdbloader2 on the bz2 nt file -> failure
> 3) Uncompress the bz2 file and run tdbloader2 -> SUCCESS

This very much looks like compression is the culprit, not DBpedia encoding.

> 4) Compress the nt file again -> failure

This is a bit weird. How do you compress the file?

Cheers,
JC

[1] http://compression.ca/pbzip2/
[2] https://issues.apache.org/jira/browse/COMPRESS-146
[3] https://issues.apache.org/jira/browse/COMPRESS-162

> Would you mind giving it a try?
>
> But anyway please check this JIRA issue out
> https://issues.apache.org/jira/browse/STANBOL-804
>
> Cheers
> Andrea
>
>
> 2013/3/20 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
>>
>> Hi Andrea,
>>
>> there used to be encoding problems, but I think they are all fixed
>> since the 3.8 release. I tried very hard to make TurtleEscaper do the
>> right thing - I checked the relevant standards etc. Could you give an
>> example where Jena complains about a DBpedia 3.8 file?
>>
>> Cheers,
>> JC
>>
>> On Wed, Mar 20, 2013 at 6:16 PM, Andrea Di Menna <ninn...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > I have been using Stanbol [1] to process DBpedia data files and build a
>> > dbpedia Solr index.
>> > Stanbol is using Jena TDB in order to load DBpedia files into a triple
>> > store.
>> > Unfortunately, almost all the DBpedia N-Triples files must be
>> > pre-processed
>> > before being able to import them using Jena [2].
>> >
>> > The following sed command is launched:
>> >
>> > sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g'
>> >
>> > Basically the backslash is replaced with the unicode character escape
>> > sequence.
>> >
>> > Do you think this should/could be fixed in
>> > org.dbpedia.extraction.util.TurtleEscaper#escapeTurtle ?
>> >
>> > Cheers
>> > Andrea
>> >
>> > [1] http://stanbol.apache.org/
>> > [2]
>> >
>> > http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > Everyone hates slow websites. So do we.
>> > Make your web apps faster with AppDynamics
>> > Download AppDynamics Lite for free today:
>> > http://p.sf.net/sfu/appdyn_d2d_mar
>> > _______________________________________________
>> > Dbpedia-discussion mailing list
>> > Dbpedia-discussion@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>> >
>
>

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Backslash encoding for URIs

Reply via email to