Hi all,

On 26/03/15 03:07, Tim Donohue wrote:
In DSpace 5, we obviously already have a basic version of a backup to 
CSV for statistics:
https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-BackuporExportSOLRrecordstointermediateformat

So, can we simply enhance that backup process so that it stores all the 
information we are capturing in our Statistics & Authority indexes, and 
build in a corresponding re-index (re-import) script?

NOTE: Some of you probably realize this, but everything I've said above 
has essentially been done by Andrea Schweer in:
* https://jira.duraspace.org/browse/DS-2486
* https://github.com/DSpace/DSpace/pull/894/

So, my main point here is that I feel a standard backup & restore 
process to/from CSV files may be a good enough solution to this Solr 
question. We just need to better document that as a *highly recommended* 
backup if you ever want to be able to restore or reindex your 
statistics/authority info.

Thanks for the shout-out Tim. I'll add a few more points based on what I've learned while writing the import/export code.

As Mark Wood pointed out elsewhere in this discussion, only stored fields can be exported. At the moment, all (relevant?) fields in the statistics solr core are stored. I think the same applies to the authority core, but I'm not certain. So we really should make sure that we check all PRs for changes to the schema for these cores and keep storing all fields that we may wish to export at some point down the track. Well and also, check all PRs that change the schema for those cores and test that all related functionality still works -- to avoid issues like breaking the sharding by introducing versioning, see https://jira.duraspace.org/browse/DS-2212

The CSV format used by Solr is tricky because export and import are not idempotent when there are multi-value fields. That is, if you do a CSV export (wt=csv) and then re-import that same CSV file without extra parameters, the data in multi-value fields will not be the same as before the export -- multiple values will be squashed into a single value. This affects eg the owning community information in the solr statistics. I do think we need to avoid hard-coding the fields because then we'd have to change the code every time the schema changes. My code inspects the schema and tells the update CSV handler which fields are multi-valued and hence should be split. This is a little bit ugly when there are multi-valued dynamic fields (which there aren't in the stock statistics schema but are in the one for my repositories, and there are in the authority schema). Interrogating the schema can either go by
  • the declared schema -- this can be done on an empty core but leaves out the actual instances of dynamic fields; or
  • the actual data -- this gives us the actual instances of dynamic fields but has to be done on a core with all data present.
My code takes the second approach since for my repositories, losing the data in multi-valued dynamic fields wasn't an option. I'm feeling a little uneasy about it; maybe if the core is empty the code should fall back on the declared schema. Thoughts / code welcome.

The multi-value field issue might also put a limit on how useful the CSV is outside of the Solr import. However, to a human eye it's probably quite clear which fields are multi-valued, so munging the data in Excel etc is probably still doable. In terms of importing the data automatically into something like ElasticSearch (or GA or whatever), this may also be a bit of a spanner. On the other hand, nothing unsurmountable I don't think. And of course the way back is currently probably not possible either -- ie, if someone is using the ES stats exclusively but wishes to switch over to the Solr-based stats later, I don't think they can dump out the data from ES and import into Solr. For the statistics, I don't think the idea of a write-to-csv stats logger is so bad, though I would want to make sure that all 3 stats loggers (Solr, ES, write-to-csv) capture the same information. But that still leaves the authority core to be looked after.

The code in my PR around year shards is a bit ugly and I'm not 100% sure it will work. I don't know whether anyone thinks this is a showstopper type issue that needs to be resolved before the PR can be merged. Given that we haven't heard much about the sharding being broken since DSpace 3, I suspect this isn't widely used. Also, year shards currently don't work with remote Solr servers anyway (https://jira.duraspace.org/browse/DS-2521), so maybe this is a non-issue. Thoughts welcome.

Another gap in my code is incremental exports. At the moment, the export part of my code dumps all of the data. I think it would be nice for back-up purposes to be able to specify a start date from which to export, so that people can export eg monthly and back up this data. I left this out for now because a) I need to get on with upgrading my repositories and b) I wanted my code to be general enough to deal with both the usage stats and the authority data. The stats core has an obviously usable date field (time). The authority core doesn't really -- there is a creation date and a last-modified date; presumably for back-up purposes the last-modified date would be more useful. Again, thoughts / code welcome.

I think re-importing without clearing the index first won't do any harm if all documents have a unique key, but I haven't tested this. Again, not sure whether this is a showstopper.

Anyway, even though there are still a few tricky spots, it looks like we're making progress. Thanks everyone who has discussed this with me and/or tested my code! Now I hope we can sort out the last few open questions in a way that is useful for the majority of the DSpace user base.

cheers,
Andrea
-- 
Dr Andrea Schweer
IRR Technical Specialist, ITS Information Systems
The University of Waikato, Hamilton, New Zealand
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Reply via email to