1) tdbstats outputs, as the name hints at, statistics about the database. It's not a dump. If you go and look at the file it gave you, you will see that it talks about things like "how many triples have a certain predicate" and the like. TDB can use this kind of info to optimize queries:
https://jena.apache.org/documentation/tdb/optimizer.html#statistics-rule-file There is a utility that dumps tuples: tdbdump: https://jena.apache.org/documentation/tdb/commands.html#tdbdump 2) tdbstats doesn't do anything to the database on which it is run. 3) It's possible that the Stanbol indexer code is using the from-scratch TDB bulkloader. The reason that might be relevant is because that loader uses sorting to sort tuples before loading them, in order to use fast loading methods. That is a step that could present the kind of behavior you saw, but it doesn't sounds that likely. The database shouldn't have been building up before then. If it's running now, after it finishes you can do a post-mortem, right? --- A. Soroka The University of Virginia Library > On Apr 6, 2016, at 3:53 PM, Antero Duarte <a.fduar...@gmail.com> wrote: > > Hi, > > Right, as an update to my last message, I changed the heap size for > tdbstats and redirected the output to a file which has 1,005,822 lines > (output of "wc -l"). Now if this directly relates to the number of > entities, I don't think it finished because the initial size of the files > in rdfdata/ was around 70GB. > > One interesting thing though, the first time I ran tdbstats, the size of > the tdb directory increased... Not much and only once, it has been stable > ever since. Because I never used tdbstats before, I don't know if this is > the indexing tool showing it's still running or just tdbstats itself > changing something in the directory. > > I restarted the indexing tool and the size of tdb seems to be increasing > again and the tool seems to be working well... Could it be that I just > needed to clear whatever was in RAM because it is a huge index? > > I will see how it goes from now on and if further problems arise I will ask > again. > > Thank you for your help. > Antero > > On Wed, 6 Apr 2016 at 20:04 A. Soroka <aj...@virginia.edu> wrote: > >> TDB command-line tools should respond to ordinary usage of the JAVA_OPTS >> environment variable, so you should be able to do something like: >> >> JAVA_OPTS="-Xmx2G" tdbstats --loc=./tdb >> >> to run with a 2GB heap, for example. >> >> One thing to note is that if you were even able to access the TDB database >> at all (which it seems you were because your attempt to use tdbstats didn't >> error out instantly) it means that no other process was using the database, >> particularly that the indexer was not. Only one system process can use a >> TDB database at a time. Now, that may or may not mean that the indexer was >> actually done doing stuff with the database: I don't know enough about how >> it works to know that, although the Stanbol devs on here would. It just >> means that at the moment you tried to run tdbstats, the indexer had >> relinquished control of the TDB database, at least for the moment. >> >> --- >> A. Soroka >> The University of Virginia Library >> >>> On Apr 5, 2016, at 6:42 PM, Antero Duarte <a.fduar...@gmail.com> wrote: >>> >>> Fair enough... I just wanted to figure out if it is actually stopped >> before >>> stopping it, but I will do that tomorrow. >>> >>> I didn't change anything with jena cmd tools, to be honest, I never used >> it >>> before, so I just googled what to do and how to do it, and the tdbstats >>> command was the one that made more sense. Is this a flag that is passed >> in >>> to tdbstats? >>> >>> Thanks, >>> Antero >>> >>> On Tue, 5 Apr 2016 8:12 pm A. Soroka, <aj...@virginia.edu> wrote: >>> >>>> I don't know whether you will be able to restart the indexer, but it >> seems >>>> like you aren't getting much of anywhere so you might as well stop it >> with >>>> the proviso that you might have to start again from scratch. But YMMV. >>>> >>>> As for the Jena side of things: did you adjust the heap allocation when >>>> you ran tdbstats? >>>> >>>> --- >>>> A. Soroka >>>> The University of Virginia Library >>>> >>>>> On Apr 5, 2016, at 12:45 PM, Antero Duarte <a.fduar...@gmail.com> >> wrote: >>>>> >>>>> Thanks for your reply. >>>>> >>>>> I downloaded jena to use the command line tools, what command should I >>>> run? >>>>> I tried running `tdbstats --loc=./tdb` (from the resources dir) but >>>> after a >>>>> whileit threw an exception because it ran out of heap space. Is there >>>>> another command that can be more useful and doesn't use as much memory? >>>> On >>>>> the other hand if I can stop the indexing tool, I can clear some memory >>>> to >>>>> let tdbstats run. >>>>> >>>>> Regards, >>>>> Antero >>>>> >>>>> On Tue, 5 Apr 2016 at 15:21 A. Soroka <aj...@virginia.edu> wrote: >>>>> >>>>>> Looks similar to something others have seen: >>>>>> >>>>>> https://issues.apache.org/jira/browse/STANBOL-1446 >>>>>> >>>>>> which doesn't help you much, but might be a place to centralize the >>>> answer >>>>>> to this question. I wouldn't think that a WARN level message would >> tag a >>>>>> condition so severe that indexing doesn't take place. Perhaps it is >>>>>> something else. >>>>>> >>>>>> Can you use Jena's command-line tools to check and see how many >> entities >>>>>> have actually been loaded into TDB vs. how many you expect? That might >>>> give >>>>>> you a clue as to where indexing is hanging up (if it actually is). >>>>>> >>>>>> --- >>>>>> A. Soroka >>>>>> The University of Virginia Library >>>>>> >>>>>>> On Apr 5, 2016, at 7:59 AM, Antero Duarte <a.fduar...@gmail.com> >>>> wrote: >>>>>>> >>>>>>> Hello there, >>>>>>> >>>>>>> I have been struggling with building indexes from generic rdf and >> even >>>>>>> using default configuration for more popular sources like dbpedia. >>>>>>> >>>>>>> I found an indexing tool online configured to index yago, at >>>>>>> https://github.com/ChalithaUdara/Stanbol-Yago-Site. >>>>>>> >>>>>>> Everything seemed to be going well until it got into this loop: >>>>>>> >>>>>>> 11:17:26,546 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'affymetrix' valid , namespace ' >>>>>>> http://bio2rdf.org/affymetrix_vocabulary:' invalid -> mapping >> ignored! >>>>>>> 11:17:26,546 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'condition' valid , namespace ' >>>>>>> http://www.kinjal.com/condition:' invalid -> mapping ignored! >>>>>>> 11:17:26,576 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'wimpo' valid , namespace ' >>>>>>> http://rdfex.org/withImports?uri=' invalid -> mapping ignored! >>>>>>> 12:17:26,856 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'nsogi' valid , namespace ' >>>>>>> http://prefix.cc/nsogi:' invalid -> mapping ignored! >>>>>>> 12:17:26,918 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'dbc' valid , namespace ' >>>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored! >>>>>>> 12:17:26,949 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'category' valid , namespace ' >>>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored! >>>>>>> 12:17:26,949 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'hgnc' valid , namespace ' >>>>>>> http://bio2rdf.org/hgnc:' invalid -> mapping ignored! >>>>>>> 12:17:26,950 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'chebi' valid , namespace ' >>>>>>> http://bio2rdf.org/chebi:' invalid -> mapping ignored! >>>>>>> 12:17:26,980 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'dbt' valid , namespace ' >>>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored! >>>>>>> 12:17:26,980 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'pubmed' valid , namespace ' >>>>>>> http://bio2rdf.org/pubmed_vocabulary:' invalid -> mapping ignored! >>>>>>> 12:17:26,980 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'dbptmpl' valid , namespace ' >>>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored! >>>>>>> 12:17:26,981 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'dbrc' valid , namespace ' >>>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored! >>>>>>> 12:17:26,981 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'call' valid , namespace ' >>>>>>> http://webofcode.org/wfn/call:' invalid -> mapping ignored! >>>>>>> 12:17:27,011 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'dbcat' valid , namespace ' >>>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored! >>>>>>> 12:17:27,011 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'bgcat' valid , namespace ' >>>>>>> http://bg.dbpedia.org/resource/?????????:' invalid -> mapping >> ignored! >>>>>>> 12:17:27,012 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'affymetrix' valid , namespace ' >>>>>>> http://bio2rdf.org/affymetrix_vocabulary:' invalid -> mapping >> ignored! >>>>>>> 12:17:27,012 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'condition' valid , namespace ' >>>>>>> http://www.kinjal.com/condition:' invalid -> mapping ignored! >>>>>>> 12:17:27,042 [pool-1-thread-1] WARN >> impl.NamespacePrefixProviderImpl - >>>>>>> Invalid Namespace Mapping: prefix 'wimpo' valid , namespace ' >>>>>>> http://rdfex.org/withImports?uri=' invalid -> mapping ignored! >>>>>>> >>>>>>> It happened to me before with the dbpedia index and at first I >> thought >>>> it >>>>>>> was some problem with the rdf source, and since theses messages are >>>>>> logged >>>>>>> at WARN level, I simply ignored them. but after days, the >> indexing/tdb >>>>>>> directory stayed the same size even though there are still files in >> the >>>>>>> indexing/resources/rdfdata directory. Then I realised that these >>>> messages >>>>>>> follow a pattern and they are logged every hour with precision to the >>>>>>> second, which seems weird. Also, they are always the same messages. >>>> This >>>>>>> led me to think that the indexing tool is stuck in a loop and that's >>>> why >>>>>> it >>>>>>> is not moving any further. I think it is important to say that the >> one >>>>>> hour >>>>>>> time span between messages is the same for the dbpedia index and for >>>> the >>>>>>> yago index, the yago index is much bigger. >>>>>>> >>>>>>> I have been constantly running `watch du * -s` in the resources >>>> directory >>>>>>> for days to check for size changes and nothing is changing and hasn't >>>>>>> changed for days. >>>>>>> >>>>>>> I don't know if this is some problem with the configuration, but >> since >>>> I >>>>>>> didn't configure it myself, I assumed that what I got from github >> would >>>>>> be >>>>>>> a working configuration for this specific index. >>>>>>> >>>>>>> I have a few questions related to this problem: >>>>>>> >>>>>>> 1) Is it safe to cancel the indexing tool and start again without >>>>>> changing >>>>>>> what's in the rdfdata and imported directories? Could this help at >> all? >>>>>>> >>>>>>> 2) What can possibly be causing this problem? >>>>>>> >>>>>>> 3) Why is it looping and logging every hour (accurate to the second)? >>>>>>> >>>>>>> If there is any extra information I can provide that would help >>>>>>> understanding what the problem is here, tell me what it is and I will >>>>>>> provide it. >>>>>>> >>>>>>> Regards, >>>>>>> Antero >>>>>> >>>>>> >>>> >>>> >> >>