Re: [Dbpedia-discussion] Backslash encoding for URIs

Andrea Di Menna Thu, 21 Mar 2013 02:22:20 -0700

Hi Jona,

I compressed the nt file with bzip2


andread@build04:~/tools/apache-jena-2.7.4/bin$ bzip2 --version
bzip2, a block-sorting file compressor.  Version 1.0.6, 6-Sept-2010.

   Copyright (C) 1996-2010 by Julian Seward.

   This program is free software; you can redistribute it and/or modify
   it under the terms set out in the LICENSE file, which is included
   in the bzip2-1.0.6 source distribution.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   LICENSE file for more details.

Also, I now tried with the same file mentioned in the JIRA bug [1], using
both Jena 2.7.4 and 2.10.0 tdbloader2, and got the following:

1) Same exception as below when running on bz2 file
2) No exception with uncompressed nt file

But I remember seeing the same exceptions as the ones in the JIRA issue
when using Stanbol indexing tool (which is building a TDB from source RDF
files, before building the Solr index).
It is likely then that the Stanbol code is not acting as the tdbloader2
when processing RDF files.

WDYT?

Cheers
Andrea

[1] http://downloads.dbpedia.org/3.8/es/redirects_es.nt.bz2

2013/3/21 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>

> On 20 March 2013 20:10, Andrea Di Menna <ninn...@gmail.com> wrote:
> > Hi Jona,
> >
> > I have tried loading labels_en_uris_de.nt.bz2 from the DBpedia 3.8
> release
> > using both Jena 2.7.4 and 2.10.0, but both fail with the following error:
> >
> > andread@build04:~/tools/apache-jena-2.10.0/bin$ ./tdbloader2 --loc .
> > /media/HD2/data/dbpedia-3.8-archive/source_data/labels_en_uris_de.nt.bz2
> >  19:48:02 -- TDB Bulk Loader Start
> >  19:48:02 Data phase
> > INFO  Load:
> > /media/HD2/data/dbpedia-3.8-archive/source_data/labels_en_uris_de.nt.bz2
> --
> > 2013/03/20 19:48:03 CET
> > Exception in thread "main" org.apache.jena.atlas.AtlasException:
> > java.nio.charset.MalformedInputException: Input length = 1
> >     at org.apache.jena.atlas.io.IO.exception(IO.java:154)
> >     at
> >
> org.apache.jena.atlas.io.CharStreamBuffered$SourceReader.fill(CharStreamBuffered.java:79)
> >     at
> >
> org.apache.jena.atlas.io.CharStreamBuffered.fillArray(CharStreamBuffered.java:156)
> >     at
> >
> org.apache.jena.atlas.io.CharStreamBuffered.advance(CharStreamBuffered.java:139)
> >     at
> > org.apache.jena.atlas.io.PeekReader.advanceAndSet(PeekReader.java:251)
> >     at org.apache.jena.atlas.io.PeekReader.init(PeekReader.java:244)
> >     at org.apache.jena.atlas.io.PeekReader.peekChar(PeekReader.java:169)
> >     at org.apache.jena.atlas.io.PeekReader.makeUTF8(PeekReader.java:108)
> >     at
> >
> org.apache.jena.riot.tokens.TokenizerFactory.makeTokenizerUTF8(TokenizerFactory.java:41)
> >     at org.apache.jena.riot.RiotReader.createParser(RiotReader.java:130)
> >     at org.apache.jena.riot.RiotReader.parse(RiotReader.java:115)
> >     at org.apache.jena.riot.RiotReader.parse(RiotReader.java:93)
> >     at org.apache.jena.riot.RiotReader.parse(RiotReader.java:66)
> >     at
> >
> com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:162)
> >     at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
> >     at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
> >     at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
> >     at
> >
> com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:80)
> > Caused by: java.nio.charset.MalformedInputException: Input length = 1
> >     at java.nio.charset.CoderResult.throwException(CoderResult.java:277)
> >     at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338)
> >     at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
> >     at java.io.InputStreamReader.read(InputStreamReader.java:184)
> >     at java.io.Reader.read(Reader.java:140)
> >     ... 17 more
> >
> > Anyway, I have now tried the following:
> >
> > 1) Download german labels
> > 2) Run tdbloader2 on the bz2 nt file -> failure
> > 3) Uncompress the bz2 file and run tdbloader2 -> SUCCESS
> > 4) Compress the nt file again -> failure
> >
> > Looks like Jena is having some problems with bz2 files then.
>
> Interesting.
>
> Since 3.8, we use parallel bzip2 [1] to compress the files (it's much
> faster on multi-core machines). The files created by pbzip2 have a
> slightly different format though. Legal for bzip2, but for example
> older versions of Commons Compress cannot deal with it [2][3].
>
> > 2) Run tdbloader2 on the bz2 nt file -> failure
> > 3) Uncompress the bz2 file and run tdbloader2 -> SUCCESS
>
> This very much looks like compression is the culprit, not DBpedia encoding.
>
> > 4) Compress the nt file again -> failure
>
> This is a bit weird. How do you compress the file?
>

> Cheers,
> JC
>
> [1] http://compression.ca/pbzip2/
> [2] https://issues.apache.org/jira/browse/COMPRESS-146
> [3] https://issues.apache.org/jira/browse/COMPRESS-162
>
> > Would you mind giving it a try?
> >
> > But anyway please check this JIRA issue out
> > https://issues.apache.org/jira/browse/STANBOL-804
> >
> > Cheers
> > Andrea
> >
> >
> > 2013/3/20 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
> >>
> >> Hi Andrea,
> >>
> >> there used to be encoding problems, but I think they are all fixed
> >> since the 3.8 release. I tried very hard to make TurtleEscaper do the
> >> right thing - I checked the relevant standards etc. Could you give an
> >> example where Jena complains about a DBpedia 3.8 file?
> >>
> >> Cheers,
> >> JC
> >>
> >> On Wed, Mar 20, 2013 at 6:16 PM, Andrea Di Menna <ninn...@gmail.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > I have been using Stanbol [1] to process DBpedia data files and build
> a
> >> > dbpedia Solr index.
> >> > Stanbol is using Jena TDB in order to load DBpedia files into a triple
> >> > store.
> >> > Unfortunately, almost all the DBpedia N-Triples files must be
> >> > pre-processed
> >> > before being able to import them using Jena [2].
> >> >
> >> > The following sed command is launched:
> >> >
> >> > sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g'
> >> >
> >> > Basically the backslash is replaced with the unicode character escape
> >> > sequence.
> >> >
> >> > Do you think this should/could be fixed in
> >> > org.dbpedia.extraction.util.TurtleEscaper#escapeTurtle ?
> >> >
> >> > Cheers
> >> > Andrea
> >> >
> >> > [1] http://stanbol.apache.org/
> >> > [2]
> >> >
> >> >
> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh
> >> >
> >> >
> >> >
> ------------------------------------------------------------------------------
> >> > Everyone hates slow websites. So do we.
> >> > Make your web apps faster with AppDynamics
> >> > Download AppDynamics Lite for free today:
> >> > http://p.sf.net/sfu/appdyn_d2d_mar
> >> > _______________________________________________
> >> > Dbpedia-discussion mailing list
> >> > Dbpedia-discussion@lists.sourceforge.net
> >> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
> >> >
> >
> >
>

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar

_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Backslash encoding for URIs

Reply via email to