Hi, Right, as an update to my last message, I changed the heap size for tdbstats and redirected the output to a file which has 1,005,822 lines (output of "wc -l"). Now if this directly relates to the number of entities, I don't think it finished because the initial size of the files in rdfdata/ was around 70GB.
One interesting thing though, the first time I ran tdbstats, the size of the tdb directory increased... Not much and only once, it has been stable ever since. Because I never used tdbstats before, I don't know if this is the indexing tool showing it's still running or just tdbstats itself changing something in the directory. I restarted the indexing tool and the size of tdb seems to be increasing again and the tool seems to be working well... Could it be that I just needed to clear whatever was in RAM because it is a huge index? I will see how it goes from now on and if further problems arise I will ask again. Thank you for your help. Antero On Wed, 6 Apr 2016 at 20:04 A. Soroka <aj...@virginia.edu> wrote: > TDB command-line tools should respond to ordinary usage of the JAVA_OPTS > environment variable, so you should be able to do something like: > > JAVA_OPTS="-Xmx2G" tdbstats --loc=./tdb > > to run with a 2GB heap, for example. > > One thing to note is that if you were even able to access the TDB database > at all (which it seems you were because your attempt to use tdbstats didn't > error out instantly) it means that no other process was using the database, > particularly that the indexer was not. Only one system process can use a > TDB database at a time. Now, that may or may not mean that the indexer was > actually done doing stuff with the database: I don't know enough about how > it works to know that, although the Stanbol devs on here would. It just > means that at the moment you tried to run tdbstats, the indexer had > relinquished control of the TDB database, at least for the moment. > > --- > A. Soroka > The University of Virginia Library > > > On Apr 5, 2016, at 6:42 PM, Antero Duarte <a.fduar...@gmail.com> wrote: > > > > Fair enough... I just wanted to figure out if it is actually stopped > before > > stopping it, but I will do that tomorrow. > > > > I didn't change anything with jena cmd tools, to be honest, I never used > it > > before, so I just googled what to do and how to do it, and the tdbstats > > command was the one that made more sense. Is this a flag that is passed > in > > to tdbstats? > > > > Thanks, > > Antero > > > > On Tue, 5 Apr 2016 8:12 pm A. Soroka, <aj...@virginia.edu> wrote: > > > >> I don't know whether you will be able to restart the indexer, but it > seems > >> like you aren't getting much of anywhere so you might as well stop it > with > >> the proviso that you might have to start again from scratch. But YMMV. > >> > >> As for the Jena side of things: did you adjust the heap allocation when > >> you ran tdbstats? > >> > >> --- > >> A. Soroka > >> The University of Virginia Library > >> > >>> On Apr 5, 2016, at 12:45 PM, Antero Duarte <a.fduar...@gmail.com> > wrote: > >>> > >>> Thanks for your reply. > >>> > >>> I downloaded jena to use the command line tools, what command should I > >> run? > >>> I tried running `tdbstats --loc=./tdb` (from the resources dir) but > >> after a > >>> whileit threw an exception because it ran out of heap space. Is there > >>> another command that can be more useful and doesn't use as much memory? > >> On > >>> the other hand if I can stop the indexing tool, I can clear some memory > >> to > >>> let tdbstats run. > >>> > >>> Regards, > >>> Antero > >>> > >>> On Tue, 5 Apr 2016 at 15:21 A. Soroka <aj...@virginia.edu> wrote: > >>> > >>>> Looks similar to something others have seen: > >>>> > >>>> https://issues.apache.org/jira/browse/STANBOL-1446 > >>>> > >>>> which doesn't help you much, but might be a place to centralize the > >> answer > >>>> to this question. I wouldn't think that a WARN level message would > tag a > >>>> condition so severe that indexing doesn't take place. Perhaps it is > >>>> something else. > >>>> > >>>> Can you use Jena's command-line tools to check and see how many > entities > >>>> have actually been loaded into TDB vs. how many you expect? That might > >> give > >>>> you a clue as to where indexing is hanging up (if it actually is). > >>>> > >>>> --- > >>>> A. Soroka > >>>> The University of Virginia Library > >>>> > >>>>> On Apr 5, 2016, at 7:59 AM, Antero Duarte <a.fduar...@gmail.com> > >> wrote: > >>>>> > >>>>> Hello there, > >>>>> > >>>>> I have been struggling with building indexes from generic rdf and > even > >>>>> using default configuration for more popular sources like dbpedia. > >>>>> > >>>>> I found an indexing tool online configured to index yago, at > >>>>> https://github.com/ChalithaUdara/Stanbol-Yago-Site. > >>>>> > >>>>> Everything seemed to be going well until it got into this loop: > >>>>> > >>>>> 11:17:26,546 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'affymetrix' valid , namespace ' > >>>>> http://bio2rdf.org/affymetrix_vocabulary:' invalid -> mapping > ignored! > >>>>> 11:17:26,546 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'condition' valid , namespace ' > >>>>> http://www.kinjal.com/condition:' invalid -> mapping ignored! > >>>>> 11:17:26,576 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'wimpo' valid , namespace ' > >>>>> http://rdfex.org/withImports?uri=' invalid -> mapping ignored! > >>>>> 12:17:26,856 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'nsogi' valid , namespace ' > >>>>> http://prefix.cc/nsogi:' invalid -> mapping ignored! > >>>>> 12:17:26,918 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'dbc' valid , namespace ' > >>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored! > >>>>> 12:17:26,949 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'category' valid , namespace ' > >>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored! > >>>>> 12:17:26,949 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'hgnc' valid , namespace ' > >>>>> http://bio2rdf.org/hgnc:' invalid -> mapping ignored! > >>>>> 12:17:26,950 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'chebi' valid , namespace ' > >>>>> http://bio2rdf.org/chebi:' invalid -> mapping ignored! > >>>>> 12:17:26,980 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'dbt' valid , namespace ' > >>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored! > >>>>> 12:17:26,980 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'pubmed' valid , namespace ' > >>>>> http://bio2rdf.org/pubmed_vocabulary:' invalid -> mapping ignored! > >>>>> 12:17:26,980 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'dbptmpl' valid , namespace ' > >>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored! > >>>>> 12:17:26,981 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'dbrc' valid , namespace ' > >>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored! > >>>>> 12:17:26,981 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'call' valid , namespace ' > >>>>> http://webofcode.org/wfn/call:' invalid -> mapping ignored! > >>>>> 12:17:27,011 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'dbcat' valid , namespace ' > >>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored! > >>>>> 12:17:27,011 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'bgcat' valid , namespace ' > >>>>> http://bg.dbpedia.org/resource/?????????:' invalid -> mapping > ignored! > >>>>> 12:17:27,012 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'affymetrix' valid , namespace ' > >>>>> http://bio2rdf.org/affymetrix_vocabulary:' invalid -> mapping > ignored! > >>>>> 12:17:27,012 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'condition' valid , namespace ' > >>>>> http://www.kinjal.com/condition:' invalid -> mapping ignored! > >>>>> 12:17:27,042 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>> Invalid Namespace Mapping: prefix 'wimpo' valid , namespace ' > >>>>> http://rdfex.org/withImports?uri=' invalid -> mapping ignored! > >>>>> > >>>>> It happened to me before with the dbpedia index and at first I > thought > >> it > >>>>> was some problem with the rdf source, and since theses messages are > >>>> logged > >>>>> at WARN level, I simply ignored them. but after days, the > indexing/tdb > >>>>> directory stayed the same size even though there are still files in > the > >>>>> indexing/resources/rdfdata directory. Then I realised that these > >> messages > >>>>> follow a pattern and they are logged every hour with precision to the > >>>>> second, which seems weird. Also, they are always the same messages. > >> This > >>>>> led me to think that the indexing tool is stuck in a loop and that's > >> why > >>>> it > >>>>> is not moving any further. I think it is important to say that the > one > >>>> hour > >>>>> time span between messages is the same for the dbpedia index and for > >> the > >>>>> yago index, the yago index is much bigger. > >>>>> > >>>>> I have been constantly running `watch du * -s` in the resources > >> directory > >>>>> for days to check for size changes and nothing is changing and hasn't > >>>>> changed for days. > >>>>> > >>>>> I don't know if this is some problem with the configuration, but > since > >> I > >>>>> didn't configure it myself, I assumed that what I got from github > would > >>>> be > >>>>> a working configuration for this specific index. > >>>>> > >>>>> I have a few questions related to this problem: > >>>>> > >>>>> 1) Is it safe to cancel the indexing tool and start again without > >>>> changing > >>>>> what's in the rdfdata and imported directories? Could this help at > all? > >>>>> > >>>>> 2) What can possibly be causing this problem? > >>>>> > >>>>> 3) Why is it looping and logging every hour (accurate to the second)? > >>>>> > >>>>> If there is any extra information I can provide that would help > >>>>> understanding what the problem is here, tell me what it is and I will > >>>>> provide it. > >>>>> > >>>>> Regards, > >>>>> Antero > >>>> > >>>> > >> > >> > >