On 22/02/13 08:18, Егор Егоров wrote:
Please, help me with the new tdb 0.9.4.

I am trying to load the BTC dataset from SemSearch-2011 challenge
(semsearch.yahoo.com <http://semsearch.yahoo.com>) into storage via tdb
0.9.4 (jena 2.7.4).

I got the following:

egor@egorov:~/semsearch-2011/dataset$ tdbloader2 --loc ../tdb
btc-2009-chunk-010-urified.gz

   12:00:43 -- TDB Bulk Loader Start
   12:00:43 Data phase
INFO  Load: btc-2009-chunk-010-urified.gz -- 2013/02/22 12:00:45 MSK
INFO  Add: 50 000 Data (Batch: 23 529 / Avg: 23 529)
...
INFO  Add: 1 750 000 Data (Batch: 103 519 / Avg: 76 509)
ERROR [line: 1777296, col: 106] Bad language tag
Exception in thread "main" org.openjena.riot.RiotException: [line:
1777296, col: 106] Bad language tag

Strange language tag is @"18".  But I need to load the entire dataset
and skip this type of error.

As a general principle for bulk loading, it is useful to check the data before starting the load. You can do this with "riot --validate"

Then you can clean the data up - text processing using sed or perl are ways to do this. The correct fix up depends on the data and the use you intend to make of it.

Letting bad data into the database can cause trouble later. Other tools may be assuming valid syntax on output.

        Andy


How to use tdbloader to ignore this type of error?

Thank You.


Reply via email to