Hi Rupert, Thank you very much for the very informative reply. Seems like the indexing tool is running as expected. Will keep update about how it goes.. Many thanks again..
Regards Amindri On 23 February 2015 at 18:50, Rupert Westenthaler < rupert.westentha...@gmail.com> wrote: > Hi Amindri, > > You can ignore those WARNINGS. They simple tell you that a literal > value failed to validate the stated data type. I am not completely > sure what jena does with such triples. But I think that it does store > them anyway in the triplestore. When I imported freebase I piped the > lodgings to a grep that removed all lines containing "WARN > jena.riot". > > You will also see some similar warnings during indexing (e.g. dates > like the 31th February ...). During indexing those data are stored as > string values. > Yes I saw these as well... > > On Mon, Feb 23, 2015 at 3:06 AM, Amindri Udugala > <amindriudug...@gmail.com> wrote: > > However I noticed that the indexing process uses up to 14 GB of ram and > > very little cpu (0% - 1%. Mostly it is 0%). Also does not seem to use > any > > disk space at all. Is this something to be worried about? > > Jena TDB uses memory mapped files. AFAIK it will use all memory it can > get for those. CPU is expected to be minimal. Most of the time is > spent in index lookups. For every triple Jena needs to lookup the > subject, predicate and object in the nodes table. After that it needs > to lookup the triple in the triple table. > In case any node or the triple does not exist it needs to update the > tables. > > So most of the time is spent in lookups and write operations. As soon > the the table get to big to be mapped in memory things start to get > slow. Depending on the hardware even very slow .... > > The WARN messages state the line number. When you do a line count on > the source file you can easily determine how much of the dump you have > already imported. You should also see loggings about the current > import speed. Combining this you can estimate the remaining time. > > best > Rupert > > > > > Thanks > > Amindri > > > > > > > > On 13 February 2015 at 17:16, Amindri Udugala <amindriudug...@gmail.com> > > wrote: > > > >> Hi Rupert, > >> > >> The fix is in the indexing tool. > >> (entityhub/indexing/core/source/LineBasedEntityIterator.java). I created > >> the issue and submitted the patch. > >> > >> Yes Rupert, the problem was jena TDB is not importing the, Freebase > dump. > >> The reason behind this was file name of my freebase data dump. It was > named > >> as freebase_latest.gz, and JenaTDB was trying to map the extension of > the > >> file with a map of Lang objects. (Check line no 61 in > RdfResourceImporter). > >> Once I renamed my Freebase dump as freebase.rdf.gz, Jena TDB started to > >> import the data. > >> > >> Then again it threw a riot exception and now I'm running the fixit.pl > >> tool on the dump. Will keep you updated on how the indexing process will > >> turn out. > >> > >> Thanks for the valuable tips on indexing. > >> > >> Thanks > >> Amindri > >> > >> > > > > > > -- > > Regards > > Amindri Udugala > > > > -- > | Rupert Westenthaler rupert.westentha...@gmail.com > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen > | REDLINK.CO > .......................................................................... > | http://redlink.co/ > -- Regards Amindri Udugala