Re: Namespaces accumulate on refresh

Michel Benevento Mon, 26 Mar 2012 08:52:09 -0700

Hi Rupert,

I'm sorry I should have clarified. I already deleted both the dist and 
destination folders before reindexing, see below for the script I use. That 
didn't work. I have now resorted to reinitializing the entire indexing setup, 
reinstalling the jar, and rebuilding the index in /sling/indexes/default/... 
and I am OK. But this is definitely something inside Stanbol, as I have been 
judiciously deleting those indexing folders.


Thanks,
Michel

rm ../stanbol/sling/datafiles/TZW.solrindex.zip
sleep 5
cd TZW
rm -rf indexing/destination
rm -rf indexing/dist
java -jar 
org.apache.stanbol.entityhub.indexing.genericrdf-0.9.0-incubating-SNAPSHOT-jar-with-dependencies.jar
 index
mv indexing/dist/TZW.solrindex.zip ../../stanbol/sling/datafiles


On 26 mrt. 2012, at 17:11, Rupert Westenthaler wrote:

> Hi Michel
> On 26.03.2012, at 16:40, Michel Benevento wrote:
> 
>> Hello,
>> 
>> As I am experimenting with various versions of my importfile I have changed 
>> my namespace urls. But when I refresh the index, the old namespaces keep 
>> accumulating in my results, resulting in duplicates. Is this intended 
>> behavior? How can I get rid of these (cached?) results and return to a 
>> pristine state?
>> 
> 
> I think I have an explanation for what you are seeing. Can you please check 
> that.
> 
> The indexing tool does NOT delete the "{indexing-root}/indexing/destination" 
> folder. So if you index your data twice without deleting this folder the new 
> data will be appended. This would explain why you still see the data with the 
> old namespaces. So please try to delete the indexing/destination folder and 
> index again.
> 
> This behavior is not a bug, but a feature because is allows to index multiple 
> datasets. I am currently writing some documentation on that so I will copy 
> the section related to the end of this mail.
> 
> best
> Rupert
> 
> - - -
> ### Indexing Datasets separately
> 
> This demo indexes all four datasets in a single step. However this is not 
> required. With a simple trick it is possible to index different datasets with 
> different indexing configurations to the same target. This section describes 
> how this could be achieved and why users might want to do this.
> 
> This demo uses Solr as target for the indexing process. Theoretically there 
> might be several possibility, but currently this is the only available 
> IndexingDestination implementation. The SolrIdnex used to store the data is 
> located at "{indexing-root}/indexing/destination/indexes/default/{name}. If 
> this directory does not alread exist it is initialized by the indexing tool 
> based on the SolrCore configuration in 
> "{indexing-root}/indexing/config/{name}" or the default SolrCore 
> configuration of not present. However if it already exists than this core is 
> used and the data of the current indexing process are added to the existing 
> SolrCore.
> 
> Because of that is is possible to subsequently add information of different 
> datasets to the same SolrIndex. However users need to know that if the 
> different dataset contain the same entity (resource with the same URI) the 
> information of the second dataset will replace those of the first. 
> Nonetheless this would allow in the given demo to create separate 
> configurations (e.g. mappings) for all four datasets while still ensuring the 
> indexed data are contained in the same SolrIndex.
> 
> This might be useful in situations where the same property (e.g. rdfs:label) 
> is used by the different datasets in different ways. Because than one could 
> create a mapping for dataset1 that maps rdfs:label > skos:prefLabel and for 
> dataset2 an mapping that ensures that rdfs:label > skos:altLabel.
> 
> Workflows like that can be easily implemented by shell scrips or by setting 
> soft links in the file system.

Re: Namespaces accumulate on refresh

Reply via email to