Hi Rupert, I started to index the freebase dump 5 days ago and everything thing seems to be good when i checked the logs. Following lines are some of the logs which I got...
11:32:58,216 [Thread-3] INFO jenatdb.RdfResourceImporter - Filtered: 1423200000 triples (80.25462627530949%) 11:33:27,960 [Thread-3] WARN jena.riot - [line: 1773380405, col: 129] Lexical form 'T17:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:33:27,960 [Thread-3] WARN jena.riot - [line: 1773380406, col: 126] Lexical form 'T10:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:33:27,961 [Thread-3] WARN jena.riot - [line: 1773380407, col: 126] Lexical form 'T17:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:33:37,043 [Thread-3] WARN jena.riot - [line: 1773388185, col: 123] Lexical form 'T00:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:33:37,043 [Thread-3] WARN jena.riot - [line: 1773388186, col: 126] Lexical form 'T00:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:35:08,676 [Thread-3] WARN jena.riot - [line: 1773466822, col: 125] Lexical form 'T08:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:35:08,676 [Thread-3] WARN jena.riot - [line: 1773466823, col: 127] Lexical form 'T16:30' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:35:08,676 [Thread-3] WARN jena.riot - [line: 1773466825, col: 128] Lexical form 'T13:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:35:12,191 [Thread-3] INFO jenatdb.RdfResourceImporter - Add: 350,200,000 triples (Batch: 333 / Avg: 702) 11:36:16,585 [Thread-3] WARN jena.riot - [line: 1773516793, col: 125] Lexical form 'T08:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:36:16,585 [Thread-3] WARN jena.riot - [line: 1773516797, col: 127] Lexical form 'T19:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:36:16,585 [Thread-3] WARN jena.riot - [line: 1773516801, col: 126] Lexical form 'T08:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:36:16,585 [Thread-3] WARN jena.riot - [line: 1773516803, col: 128] Lexical form 'T18:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:36:16,586 [Thread-3] WARN jena.riot - [line: 1773516806, col: 129] Lexical form 'T19:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:36:16,586 [Thread-3] WARN jena.riot - [line: 1773516807, col: 123] Lexical form 'T08:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:36:16,586 [Thread-3] WARN jena.riot - [line: 1773516808, col: 125] Lexical form 'T09:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:36:16,586 [Thread-3] WARN jena.riot - [line: 1773516809, col: 123] Lexical form 'T08:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:36:22,545 [Thread-3] INFO jenatdb.RdfResourceImporter - Filtered: 1423300000 triples (80.25274111404639%) 11:37:41,460 [Thread-3] INFO jenatdb.RdfResourceImporter - Add: 350,250,000 triples (Batch: 334 / Avg: 702) 11:39:29,824 [Thread-3] DEBUG file.BlockAccessMapped - Segment: 706 11:39:38,981 [Thread-3] INFO jenatdb.RdfResourceImporter - Filtered: 1423400000 triples (80.25097751824737%) 11:40:14,344 [Thread-3] INFO jenatdb.RdfResourceImporter - Add: 350,300,000 triples (Batch: 327 / Avg: 702) 11:40:23,581 [Thread-3] DEBUG file.BlockAccessMapped - Segment: 1930 11:41:55,656 [Thread-3] INFO jenatdb.RdfResourceImporter - Add: 350,350,000 triples (Batch: 493 / Avg: 702) 11:41:58,858 [Thread-3] INFO jenatdb.RdfResourceImporter - Filtered: 1423500000 triples (80.2491098335781%) 11:42:04,305 [Thread-3] WARN jena.riot - [line: 1773858075, col: 98] Lexical form 'T15:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:42:04,325 [Thread-3] WARN jena.riot - [line: 1773858119, col: 100] Lexical form 'T18:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:42:04,364 [Thread-3] WARN jena.riot - [line: 1773858168, col: 98] Lexical form 'T13:30' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:42:04,364 [Thread-3] WARN jena.riot - [line: 1773858169, col: 100] Lexical form 'T00:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime 11:43:46,111 [Thread-3] WARN jena.riot - [line: 1773975044, col: 88] Bad IRI: <http://www.amazon.de:80/exec/obidos/ASIN/B00005V6S1> Code: 13/DEFAULT_PORT_SHOULD_BE_OMITTED in PORT: If the port is the default one for the scheme it should be omitted. 11:43:46,111 [Thread-3] WARN jena.riot - [line: 1773975044, col: 88] Bad IRI: <http://www.amazon.de:80/exec/obidos/ASIN/B00005V6S1> Code: 14/PORT_SHOULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name. However I noticed that the indexing process uses up to 14 GB of ram and very little cpu (0% - 1%. Mostly it is 0%). Also does not seem to use any disk space at all. Is this something to be worried about? Thanks Amindri On 13 February 2015 at 17:16, Amindri Udugala <amindriudug...@gmail.com> wrote: > Hi Rupert, > > The fix is in the indexing tool. > (entityhub/indexing/core/source/LineBasedEntityIterator.java). I created > the issue and submitted the patch. > > Yes Rupert, the problem was jena TDB is not importing the, Freebase dump. > The reason behind this was file name of my freebase data dump. It was named > as freebase_latest.gz, and JenaTDB was trying to map the extension of the > file with a map of Lang objects. (Check line no 61 in RdfResourceImporter). > Once I renamed my Freebase dump as freebase.rdf.gz, Jena TDB started to > import the data. > > Then again it threw a riot exception and now I'm running the fixit.pl > tool on the dump. Will keep you updated on how the indexing process will > turn out. > > Thanks for the valuable tips on indexing. > > Thanks > Amindri > > -- Regards Amindri Udugala