Hi Michael On Tue, Mar 27, 2012 at 9:55 AM, Michel Benevento <[email protected]> wrote: > Succes! > > resources/tdb was the culprit, thank you Rupert. >
good to hear > PS Maybe it should be a setting in indexing.properties(?) if you want to > override or append to an index? > This would be an other possibility to solve this issue it that with the named graphs does not work out. I prefer the named graph thing because it would work "magically" - without the need that the users provides any kind of configuration. However a property like that would be a good Idea for enabling/disabling the automatic deletion of the destination folder. best Rupert > > > On 27 mrt. 2012, at 09:38, Rupert Westenthaler wrote: > >> Hi Michael >> >> Can you please try the following >> >> On Mon, Mar 26, 2012 at 5:51 PM, Michel Benevento <[email protected]> >> wrote: >> >>> rm ../stanbol/sling/datafiles/TZW.solrindex.zip >>> sleep 5 >>> cd TZW >>> rm -rf indexing/destination >>> rm -rf indexing/dist >> >> rm -rf indexing/resource/tdb >> >>> java -jar >>> org.apache.stanbol.entityhub.indexing.genericrdf-0.9.0-incubating-SNAPSHOT-jar-with-dependencies.jar >>> index >>> mv indexing/dist/TZW.solrindex.zip ../../stanbol/sling/datafiles >>> >> >> The "indexing/resource/tdb" folder contains the Jena TDB triplestore >> with the imported RDF data. This data are kept in-between indexing >> processes mainly because the time needed to import the RDF data is >> typically approximately the same as needed for the indexing process. >> Because of that it makes a lot of sense to reuse already imported RDF >> data if you index RDF dumps (e.g. DBpedia). >> >> In the case where the RDF data change this default is not optimal, >> because the changed dataset is appended to data already present in the >> Jena TDB store. This means that if you change or remove things in your >> thesaurus they will still be present within the triple store and >> therefore also appear in the created index. >> >> I must say that it is very confusing if users need to delete something >> within the "/indexing/resources" folder if they change the RDF data. >> So I will create an issue to change this behavior. I think I will try >> to create named graphs for each imported RDF file. This would allow to >> automatically delete already existing data within the Jena TDB store >> if a file with the same name is imported again. >> >> Can you please check and report back if this is the cause of your problem. >> >> Thanks in advance >> >> best >> Rupert >> >>> >>> On 26 mrt. 2012, at 17:11, Rupert Westenthaler wrote: >>> >>>> Hi Michel >>>> On 26.03.2012, at 16:40, Michel Benevento wrote: >>>> >>>>> Hello, >>>>> >>>>> As I am experimenting with various versions of my importfile I have >>>>> changed my namespace urls. But when I refresh the index, the old >>>>> namespaces keep accumulating in my results, resulting in duplicates. Is >>>>> this intended behavior? How can I get rid of these (cached?) results and >>>>> return to a pristine state? >>>>> >>>> >>>> I think I have an explanation for what you are seeing. Can you please >>>> check that. >>>> >>>> The indexing tool does NOT delete the >>>> "{indexing-root}/indexing/destination" folder. So if you index your data >>>> twice without deleting this folder the new data will be appended. This >>>> would explain why you still see the data with the old namespaces. So >>>> please try to delete the indexing/destination folder and index again. >>>> >>>> This behavior is not a bug, but a feature because is allows to index >>>> multiple datasets. I am currently writing some documentation on that so I >>>> will copy the section related to the end of this mail. >>>> >>>> best >>>> Rupert >>>> >>>> - - - >>>> ### Indexing Datasets separately >>>> >>>> This demo indexes all four datasets in a single step. However this is not >>>> required. With a simple trick it is possible to index different datasets >>>> with different indexing configurations to the same target. This section >>>> describes how this could be achieved and why users might want to do this. >>>> >>>> This demo uses Solr as target for the indexing process. Theoretically >>>> there might be several possibility, but currently this is the only >>>> available IndexingDestination implementation. The SolrIdnex used to store >>>> the data is located at >>>> "{indexing-root}/indexing/destination/indexes/default/{name}. If this >>>> directory does not alread exist it is initialized by the indexing tool >>>> based on the SolrCore configuration in >>>> "{indexing-root}/indexing/config/{name}" or the default SolrCore >>>> configuration of not present. However if it already exists than this core >>>> is used and the data of the current indexing process are added to the >>>> existing SolrCore. >>>> >>>> Because of that is is possible to subsequently add information of >>>> different datasets to the same SolrIndex. However users need to know that >>>> if the different dataset contain the same entity (resource with the same >>>> URI) the information of the second dataset will replace those of the >>>> first. Nonetheless this would allow in the given demo to create separate >>>> configurations (e.g. mappings) for all four datasets while still ensuring >>>> the indexed data are contained in the same SolrIndex. >>>> >>>> This might be useful in situations where the same property (e.g. >>>> rdfs:label) is used by the different datasets in different ways. Because >>>> than one could create a mapping for dataset1 that maps rdfs:label > >>>> skos:prefLabel and for dataset2 an mapping that ensures that rdfs:label > >>>> skos:altLabel. >>>> >>>> Workflows like that can be easily implemented by shell scrips or by >>>> setting soft links in the file system. >>> >> >> >> >> -- >> | Rupert Westenthaler [email protected] >> | Bodenlehenstraße 11 ++43-699-11108907 >> | A-5500 Bischofshofen > -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
