Hi Rupert,
I'm sorry I should have clarified. I already deleted both the dist and
destination folders before reindexing, see below for the script I use. That
didn't work. I have now resorted to reinitializing the entire indexing setup,
reinstalling the jar, and rebuilding the index in /sling/indexes/default/...
and I am OK. But this is definitely something inside Stanbol, as I have been
judiciously deleting those indexing folders.
Thanks,
Michel
rm ../stanbol/sling/datafiles/TZW.solrindex.zip
sleep 5
cd TZW
rm -rf indexing/destination
rm -rf indexing/dist
java -jar
org.apache.stanbol.entityhub.indexing.genericrdf-0.9.0-incubating-SNAPSHOT-jar-with-dependencies.jar
index
mv indexing/dist/TZW.solrindex.zip ../../stanbol/sling/datafiles
On 26 mrt. 2012, at 17:11, Rupert Westenthaler wrote:
> Hi Michel
> On 26.03.2012, at 16:40, Michel Benevento wrote:
>
>> Hello,
>>
>> As I am experimenting with various versions of my importfile I have changed
>> my namespace urls. But when I refresh the index, the old namespaces keep
>> accumulating in my results, resulting in duplicates. Is this intended
>> behavior? How can I get rid of these (cached?) results and return to a
>> pristine state?
>>
>
> I think I have an explanation for what you are seeing. Can you please check
> that.
>
> The indexing tool does NOT delete the "{indexing-root}/indexing/destination"
> folder. So if you index your data twice without deleting this folder the new
> data will be appended. This would explain why you still see the data with the
> old namespaces. So please try to delete the indexing/destination folder and
> index again.
>
> This behavior is not a bug, but a feature because is allows to index multiple
> datasets. I am currently writing some documentation on that so I will copy
> the section related to the end of this mail.
>
> best
> Rupert
>
> - - -
> ### Indexing Datasets separately
>
> This demo indexes all four datasets in a single step. However this is not
> required. With a simple trick it is possible to index different datasets with
> different indexing configurations to the same target. This section describes
> how this could be achieved and why users might want to do this.
>
> This demo uses Solr as target for the indexing process. Theoretically there
> might be several possibility, but currently this is the only available
> IndexingDestination implementation. The SolrIdnex used to store the data is
> located at "{indexing-root}/indexing/destination/indexes/default/{name}. If
> this directory does not alread exist it is initialized by the indexing tool
> based on the SolrCore configuration in
> "{indexing-root}/indexing/config/{name}" or the default SolrCore
> configuration of not present. However if it already exists than this core is
> used and the data of the current indexing process are added to the existing
> SolrCore.
>
> Because of that is is possible to subsequently add information of different
> datasets to the same SolrIndex. However users need to know that if the
> different dataset contain the same entity (resource with the same URI) the
> information of the second dataset will replace those of the first.
> Nonetheless this would allow in the given demo to create separate
> configurations (e.g. mappings) for all four datasets while still ensuring the
> indexed data are contained in the same SolrIndex.
>
> This might be useful in situations where the same property (e.g. rdfs:label)
> is used by the different datasets in different ways. Because than one could
> create a mapping for dataset1 that maps rdfs:label > skos:prefLabel and for
> dataset2 an mapping that ensures that rdfs:label > skos:altLabel.
>
> Workflows like that can be easily implemented by shell scrips or by setting
> soft links in the file system.