Re: [Dbpedia-discussion] Backslash encoding for URIs

Andrea Di Menna Thu, 21 Mar 2013 08:26:17 -0700

Hi Jona,

Rupert is great!


I am now building revision 1459296. Will let you know.

Cheers
Andrea

2013/3/21 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>

> Hi Andrea,
>
> Rupert commited a change: http://svn.apache.org/r1459296 (Wow! Fast
> response!)
>
> Andrea: maybe you could try the SVN version and tell us if it works?
>
> Cheers,
> JC
>
> On 21 March 2013 12:08, Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
> wrote:
> > Hi Andrea, Rupert,
> >
> > Rupert, maybe you can help. Summary: DBpedia backslash escaping is
> > (most likely) correct since 3.8. Stanbol / Jena can read the DBpedia
> > 3.8 files fine if they are uncompressed first. It looks like Stanbol
> > has a problem with bz2.
> >
> > https://issues.apache.org/jira/browse/STANBOL-804
> > http://markmail.org/message/67ivlyoxfqad6xoe
> >
> > Cheers,
> > JC
> >
> > On 21 March 2013 10:20, Andrea Di Menna <ninn...@gmail.com> wrote:
> >> Hi Jona,
> >>
> >> I compressed the nt file with bzip2
> >>
> >> andread@build04:~/tools/apache-jena-2.7.4/bin$ bzip2 --version
> >> bzip2, a block-sorting file compressor.  Version 1.0.6, 6-Sept-2010.
> >>
> >>    Copyright (C) 1996-2010 by Julian Seward.
> >>
> >>    This program is free software; you can redistribute it and/or modify
> >>    it under the terms set out in the LICENSE file, which is included
> >>    in the bzip2-1.0.6 source distribution.
> >>
> >>    This program is distributed in the hope that it will be useful,
> >>    but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >>    LICENSE file for more details.
> >>
> >> Also, I now tried with the same file mentioned in the JIRA bug [1],
> using
> >> both Jena 2.7.4 and 2.10.0 tdbloader2, and got the following:
> >>
> >> 1) Same exception as below when running on bz2 file
> >> 2) No exception with uncompressed nt file
> >>
> >> But I remember seeing the same exceptions as the ones in the JIRA issue
> when
> >> using Stanbol indexing tool (which is building a TDB from source RDF
> files,
> >> before building the Solr index).
> >> It is likely then that the Stanbol code is not acting as the tdbloader2
> when
> >> processing RDF files.
> >>
> >> WDYT?
> >>
> >> Cheers
> >> Andrea
> >>
> >> [1] http://downloads.dbpedia.org/3.8/es/redirects_es.nt.bz2
> >>
> >>
> >> 2013/3/21 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
> >>>
> >>> On 20 March 2013 20:10, Andrea Di Menna <ninn...@gmail.com> wrote:
> >>> > Hi Jona,
> >>> >
> >>> > I have tried loading labels_en_uris_de.nt.bz2 from the DBpedia 3.8
> >>> > release
> >>> > using both Jena 2.7.4 and 2.10.0, but both fail with the following
> >>> > error:
> >>> >
> >>> > andread@build04:~/tools/apache-jena-2.10.0/bin$ ./tdbloader2 --loc .
> >>> >
> /media/HD2/data/dbpedia-3.8-archive/source_data/labels_en_uris_de.nt.bz2
> >>> >  19:48:02 -- TDB Bulk Loader Start
> >>> >  19:48:02 Data phase
> >>> > INFO  Load:
> >>> >
> /media/HD2/data/dbpedia-3.8-archive/source_data/labels_en_uris_de.nt.bz2
> >>> > --
> >>> > 2013/03/20 19:48:03 CET
> >>> > Exception in thread "main" org.apache.jena.atlas.AtlasException:
> >>> > java.nio.charset.MalformedInputException: Input length = 1
> >>> >     at org.apache.jena.atlas.io.IO.exception(IO.java:154)
> >>> >     at
> >>> >
> >>> >
> org.apache.jena.atlas.io.CharStreamBuffered$SourceReader.fill(CharStreamBuffered.java:79)
> >>> >     at
> >>> >
> >>> >
> org.apache.jena.atlas.io.CharStreamBuffered.fillArray(CharStreamBuffered.java:156)
> >>> >     at
> >>> >
> >>> >
> org.apache.jena.atlas.io.CharStreamBuffered.advance(CharStreamBuffered.java:139)
> >>> >     at
> >>> >
> org.apache.jena.atlas.io.PeekReader.advanceAndSet(PeekReader.java:251)
> >>> >     at org.apache.jena.atlas.io.PeekReader.init(PeekReader.java:244)
> >>> >     at
> org.apache.jena.atlas.io.PeekReader.peekChar(PeekReader.java:169)
> >>> >     at
> org.apache.jena.atlas.io.PeekReader.makeUTF8(PeekReader.java:108)
> >>> >     at
> >>> >
> >>> >
> org.apache.jena.riot.tokens.TokenizerFactory.makeTokenizerUTF8(TokenizerFactory.java:41)
> >>> >     at
> org.apache.jena.riot.RiotReader.createParser(RiotReader.java:130)
> >>> >     at org.apache.jena.riot.RiotReader.parse(RiotReader.java:115)
> >>> >     at org.apache.jena.riot.RiotReader.parse(RiotReader.java:93)
> >>> >     at org.apache.jena.riot.RiotReader.parse(RiotReader.java:66)
> >>> >     at
> >>> >
> >>> >
> com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:162)
> >>> >     at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
> >>> >     at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
> >>> >     at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
> >>> >     at
> >>> >
> >>> >
> com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:80)
> >>> > Caused by: java.nio.charset.MalformedInputException: Input length = 1
> >>> >     at
> java.nio.charset.CoderResult.throwException(CoderResult.java:277)
> >>> >     at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338)
> >>> >     at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
> >>> >     at java.io.InputStreamReader.read(InputStreamReader.java:184)
> >>> >     at java.io.Reader.read(Reader.java:140)
> >>> >     ... 17 more
> >>> >
> >>> > Anyway, I have now tried the following:
> >>> >
> >>> > 1) Download german labels
> >>> > 2) Run tdbloader2 on the bz2 nt file -> failure
> >>> > 3) Uncompress the bz2 file and run tdbloader2 -> SUCCESS
> >>> > 4) Compress the nt file again -> failure
> >>> >
> >>> > Looks like Jena is having some problems with bz2 files then.
> >>>
> >>> Interesting.
> >>>
> >>> Since 3.8, we use parallel bzip2 [1] to compress the files (it's much
> >>> faster on multi-core machines). The files created by pbzip2 have a
> >>> slightly different format though. Legal for bzip2, but for example
> >>> older versions of Commons Compress cannot deal with it [2][3].
> >>>
> >>> > 2) Run tdbloader2 on the bz2 nt file -> failure
> >>> > 3) Uncompress the bz2 file and run tdbloader2 -> SUCCESS
> >>>
> >>> This very much looks like compression is the culprit, not DBpedia
> >>> encoding.
> >>>
> >>> > 4) Compress the nt file again -> failure
> >>>
> >>> This is a bit weird. How do you compress the file?
> >>>
> >>>
> >>> Cheers,
> >>> JC
> >>>
> >>> [1] http://compression.ca/pbzip2/
> >>> [2] https://issues.apache.org/jira/browse/COMPRESS-146
> >>> [3] https://issues.apache.org/jira/browse/COMPRESS-162
> >>>
> >>> > Would you mind giving it a try?
> >>> >
> >>> > But anyway please check this JIRA issue out
> >>> > https://issues.apache.org/jira/browse/STANBOL-804
> >>> >
> >>> > Cheers
> >>> > Andrea
> >>> >
> >>> >
> >>> > 2013/3/20 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
> >>> >>
> >>> >> Hi Andrea,
> >>> >>
> >>> >> there used to be encoding problems, but I think they are all fixed
> >>> >> since the 3.8 release. I tried very hard to make TurtleEscaper do
> the
> >>> >> right thing - I checked the relevant standards etc. Could you give
> an
> >>> >> example where Jena complains about a DBpedia 3.8 file?
> >>> >>
> >>> >> Cheers,
> >>> >> JC
> >>> >>
> >>> >> On Wed, Mar 20, 2013 at 6:16 PM, Andrea Di Menna <ninn...@gmail.com
> >
> >>> >> wrote:
> >>> >> > Hi,
> >>> >> >
> >>> >> > I have been using Stanbol [1] to process DBpedia data files and
> build
> >>> >> > a
> >>> >> > dbpedia Solr index.
> >>> >> > Stanbol is using Jena TDB in order to load DBpedia files into a
> >>> >> > triple
> >>> >> > store.
> >>> >> > Unfortunately, almost all the DBpedia N-Triples files must be
> >>> >> > pre-processed
> >>> >> > before being able to import them using Jena [2].
> >>> >> >
> >>> >> > The following sed command is launched:
> >>> >> >
> >>> >> > sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g'
> >>> >> >
> >>> >> > Basically the backslash is replaced with the unicode character
> escape
> >>> >> > sequence.
> >>> >> >
> >>> >> > Do you think this should/could be fixed in
> >>> >> > org.dbpedia.extraction.util.TurtleEscaper#escapeTurtle ?
> >>> >> >
> >>> >> > Cheers
> >>> >> > Andrea
> >>> >> >
> >>> >> > [1] http://stanbol.apache.org/
> >>> >> > [2]
> >>> >> >
> >>> >> >
> >>> >> >
> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> >
> ------------------------------------------------------------------------------
> >>> >> > Everyone hates slow websites. So do we.
> >>> >> > Make your web apps faster with AppDynamics
> >>> >> > Download AppDynamics Lite for free today:
> >>> >> > http://p.sf.net/sfu/appdyn_d2d_mar
> >>> >> > _______________________________________________
> >>> >> > Dbpedia-discussion mailing list
> >>> >> > Dbpedia-discussion@lists.sourceforge.net
> >>> >> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
> >>> >> >
> >>> >
> >>> >
> >>
> >>
>

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar

_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Backslash encoding for URIs

Reply via email to