Hi Rupert,

I started to index the freebase dump 5 days ago and everything thing seems
to be good when i checked the logs. Following lines are some of the logs
which I got...


11:32:58,216 [Thread-3] INFO  jenatdb.RdfResourceImporter - Filtered:
1423200000 triples (80.25462627530949%)
11:33:27,960 [Thread-3] WARN  jena.riot - [line: 1773380405, col: 129]
Lexical form 'T17:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:33:27,960 [Thread-3] WARN  jena.riot - [line: 1773380406, col: 126]
Lexical form 'T10:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:33:27,961 [Thread-3] WARN  jena.riot - [line: 1773380407, col: 126]
Lexical form 'T17:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:33:37,043 [Thread-3] WARN  jena.riot - [line: 1773388185, col: 123]
Lexical form 'T00:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:33:37,043 [Thread-3] WARN  jena.riot - [line: 1773388186, col: 126]
Lexical form 'T00:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:35:08,676 [Thread-3] WARN  jena.riot - [line: 1773466822, col: 125]
Lexical form 'T08:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:35:08,676 [Thread-3] WARN  jena.riot - [line: 1773466823, col: 127]
Lexical form 'T16:30' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:35:08,676 [Thread-3] WARN  jena.riot - [line: 1773466825, col: 128]
Lexical form 'T13:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:35:12,191 [Thread-3] INFO  jenatdb.RdfResourceImporter - Add:
350,200,000 triples (Batch: 333 / Avg: 702)
11:36:16,585 [Thread-3] WARN  jena.riot - [line: 1773516793, col: 125]
Lexical form 'T08:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:36:16,585 [Thread-3] WARN  jena.riot - [line: 1773516797, col: 127]
Lexical form 'T19:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:36:16,585 [Thread-3] WARN  jena.riot - [line: 1773516801, col: 126]
Lexical form 'T08:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:36:16,585 [Thread-3] WARN  jena.riot - [line: 1773516803, col: 128]
Lexical form 'T18:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:36:16,586 [Thread-3] WARN  jena.riot - [line: 1773516806, col: 129]
Lexical form 'T19:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:36:16,586 [Thread-3] WARN  jena.riot - [line: 1773516807, col: 123]
Lexical form 'T08:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:36:16,586 [Thread-3] WARN  jena.riot - [line: 1773516808, col: 125]
Lexical form 'T09:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:36:16,586 [Thread-3] WARN  jena.riot - [line: 1773516809, col: 123]
Lexical form 'T08:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:36:22,545 [Thread-3] INFO  jenatdb.RdfResourceImporter - Filtered:
1423300000 triples (80.25274111404639%)
11:37:41,460 [Thread-3] INFO  jenatdb.RdfResourceImporter - Add:
350,250,000 triples (Batch: 334 / Avg: 702)
11:39:29,824 [Thread-3] DEBUG file.BlockAccessMapped - Segment: 706
11:39:38,981 [Thread-3] INFO  jenatdb.RdfResourceImporter - Filtered:
1423400000 triples (80.25097751824737%)
11:40:14,344 [Thread-3] INFO  jenatdb.RdfResourceImporter - Add:
350,300,000 triples (Batch: 327 / Avg: 702)
11:40:23,581 [Thread-3] DEBUG file.BlockAccessMapped - Segment: 1930
11:41:55,656 [Thread-3] INFO  jenatdb.RdfResourceImporter - Add:
350,350,000 triples (Batch: 493 / Avg: 702)
11:41:58,858 [Thread-3] INFO  jenatdb.RdfResourceImporter - Filtered:
1423500000 triples (80.2491098335781%)
11:42:04,305 [Thread-3] WARN  jena.riot - [line: 1773858075, col: 98]
Lexical form 'T15:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:42:04,325 [Thread-3] WARN  jena.riot - [line: 1773858119, col: 100]
Lexical form 'T18:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:42:04,364 [Thread-3] WARN  jena.riot - [line: 1773858168, col: 98]
Lexical form 'T13:30' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:42:04,364 [Thread-3] WARN  jena.riot - [line: 1773858169, col: 100]
Lexical form 'T00:00' not valid for datatype
http://www.w3.org/2001/XMLSchema#dateTime
11:43:46,111 [Thread-3] WARN  jena.riot - [line: 1773975044, col: 88] Bad
IRI: <http://www.amazon.de:80/exec/obidos/ASIN/B00005V6S1> Code:
13/DEFAULT_PORT_SHOULD_BE_OMITTED in PORT: If the port is the default one
for the scheme it should be omitted.
11:43:46,111 [Thread-3] WARN  jena.riot - [line: 1773975044, col: 88] Bad
IRI: <http://www.amazon.de:80/exec/obidos/ASIN/B00005V6S1> Code:
14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be
accessed using the appropriate scheme name.


However I noticed that the indexing process uses up to 14 GB of ram and
very little cpu (0% - 1%. Mostly it  is 0%). Also does not seem to use any
disk space at all. Is this something to be worried about?

Thanks
Amindri



On 13 February 2015 at 17:16, Amindri Udugala <amindriudug...@gmail.com>
wrote:

> Hi Rupert,
>
> The fix is in the indexing tool.
> (entityhub/indexing/core/source/LineBasedEntityIterator.java). I created
> the issue and submitted the patch.
>
> Yes Rupert, the problem was jena TDB is not importing the, Freebase dump.
> The reason behind this was file name of my freebase data dump. It was named
> as freebase_latest.gz, and JenaTDB was trying to map the extension of the
> file with a map of Lang objects. (Check line no 61 in RdfResourceImporter).
> Once I renamed my Freebase dump as freebase.rdf.gz, Jena TDB started to
> import the data.
>
> Then again it threw a riot exception and now I'm running the fixit.pl
> tool on the dump. Will keep you updated on how the indexing process will
> turn out.
>
> Thanks for the valuable tips on indexing.
>
> Thanks
> Amindri
>
>


-- 
Regards
Amindri Udugala

Reply via email to