Re: Need help in loading a large RDF file into JENA TDB using TDBLoader

Andy Seaborne Fri, 28 Dec 2012 06:07:23 -0800

On 28/12/12 13:26, Abhishek Shivkumar wrote:

Thanks Andy. Awesome!


So, I am downloading the latest dump of Freebase RDF ->
freebase-rdf-2012-12-23-00-00.gz

Let me check with that and use tdbloader to see if it has been corrected.

Also, when will JENA 2.10.0 with this correction, be released?


Development builds are always available:

https://repository.apache.org/content/repositories/snapshots/org/apache/jena/

We don't have a date for the release but

(1) the development builds should pass all the tests - the CI Jenkinsinstallation will tell the dev mailing list if not!

(2) expect a test/release cycle for 2.10.0 when we ask the community totest things before a release, and not after, so it that will take a while.

So while not as highly tested, the dev builds are generally pretty goodand contain no known issues that would block as release.


I wrote to freebase-discuss about the 13 issue but running

sed -e 's/\.$/ ./'

might be a good idea.

To load TDB, you're going to need a big RAM machine and patience.

So checking the data first, despite the delay, is going to be a good idea.

tdbloader2 is likely to be faster. There are some tuning parameters aswell - see the script for details. Adding --parallel=3 is good if yoursort(1) supports it.

I'll download the 32-12 data but I don't have access to a large machineat the moment. I'm going to run "split -l 10000000" to make findingerrors easier.


        Andy


Thank you!

With Regards,
Abhishek S


On Fri, Dec 28, 2012 at 6:32 PM, Andy Seaborne <[email protected]
<mailto:[email protected]>> wrote:

    On 28/12/12 07:42, Abhishek Shivkumar wrote:

        Hi Andy,

            Here are the triples from the neighborhood of line 270608. i
        tried
        finding the error but couldn't. Do you see any by chance?
        I printed the line number too on the left just in case.  Ex:
        "line num
        270591-"


    Not quite the right line but close ... this may be the problem:

    Line:
    -----------------

    ns:m.01gqn1
    ns:base.braziliangovt.__brazilian_political_party.__number      13.
    -----------------

    and the problem is the   13.

    The WG spec in development has:

    [21]    DECIMAL         ::=     [+-]? [0-9]* '.' [0-9]+

    so a decimal must have a trailing digit, and "13." is integer 13
    followed by a DOT (terminates the triples).

    But in the W3C submission has a know problem in this area:

    [18]    decimal         ::=     ('-' | '+')? ( [0-9]+ '.' [0-9]* |
    '.' ([0-9])+ | ([0-9])+ )

    and 13. is ambiguous.  Is it 13 and a DOT or a decimal with lexical
    form "13."  The normal way to tokenize is to choose the longest
    match (so ":abc" isn't ":a" then "bc") and that means you need a
    space to the tokens '13' and DOT

    Jena 2.7.4 follows the submission and "13." is a decimal and the
    needs a trailing DOT.

    In fact, using space-DOT everywhere would be very sensible.
      Trailing dots on prefix names may confuse some older parsers.

    Jena development (2.10.0) follows the W3C WG spec and it's 13
    integer and a trailing DOT and parses.

    Do you have a corrected version of freebase-rdf-2012-12-09-00-00?  I
    downloaded it but there are other things to fix up before it gets to
    that point.

             Andy

Re: Need help in loading a large RDF file into JENA TDB using TDBLoader

Reply via email to